configpolicy

dustin

Author	SHA1	Message	Date
Dustin	3511176c31	r/gitea: Configure SMTP mailer Gitea needs SMTP configuration in order to send e-mail notifications about e.g. pull requests. The `gitea_smtp` variable can be defined to enable this feature.	2024-08-25 08:46:37 -05:00
Dustin	85da487cb8	r/dch-proxy: Define sites declaratively I've already made a couple of mistakes keeping the HTTP and HTTPS rules in sync. Let's define the sites declaratively and derive the HAProxy rules from the data, rather then manually type the rules.	2024-08-24 11:48:45 -05:00
Dustin	2fa28dfa5f	r/dch-proxy: Update and clean up The dch-proxy role has not been used for quite some time. The web server has been handling the reerse proxy functionality, in addition to hosting websites. The drawback to using Apache as the reverse proxy, though, is that it operates in TLS-terminating mode, so it needs to have the correct certificate for every site and application it proxies for. This is becoming cumbersome, especially now that there are several sites that do not use the _pyrocufflink.net_ wildcard certificate. Notably, Tabitha's _hatchlearningcenter.org_ is problematic because although the main site are hosted by the web server, the Invoice Ninja client portal is hosted in Kubernetes. Switching back to HAProxy to provide the reverse proxy functionality will eliminate the need to have the server certificate both on the backend and on the reverse proxy, as it can operate in TLS-passthrough mode. The main reason I stopped using HAProxy in the first place was because when using TLS-passthrough mode, the original source IP address is lost. Fortunately, HAProxy and Apache can both be configured to use the PROXY protocol, which provides a mechanism for communicating the original IP address while still passing through the TLS connection unmodified. This is particularly important for Nextcloud because of its built-in intrusion prevention; without knowing the actual source IP address, it blocks _everyone_, since all connections appear to come from the reverse proxy's IP address. Combining TLS-passthrough mode with the PROXY protocol resolves both the certificate management issue and the source IP address issue. I've cleaned up the _dch-proxy_ role quite a bit in this commit. Notably, I consolidated all the backend and frontend definitions into a single file; it didn't really make sense to have them all separate, since they were managed by the same role and referred to each other. Of course, I had to update the backends to match the currently-deployed applications as well.	2024-08-24 11:46:28 -05:00
Dustin	153b210a73	vm-hosts: Do not reboot after auto updates For obvious reasons, the VM hosts cannot automatically reboot themselves.	2024-08-23 09:33:29 -05:00
Dustin	c546f09335	smtp-relay: Rewrite dustin@hatch.name Sometimes, the mail server for hatch.name is extremely slow. While there isn't much I can do about it for external senders, I can at least ensure that email messages sent by internal services like Authelia are always delivered quickly by rewriting the recipient address to my actualy email address, bypassing the hatch.name exchange entirely.	2024-08-22 16:17:00 -05:00
Dustin	a2cf78f3f5	vm-hosts: Update vm-autostart logs0.pyrocufflink.blue has been replaced by loki0.pyrocufflink.blue since ages, so I'm not sure how I hadn't updated the autostart list with it yet. unifi3.pyrocufflink.blue replaced unifi2.p.b recently, when I was testing Luci/etcd.	2024-08-14 20:26:11 -05:00
Dustin	6d65e0594f	frigate: Configure HTTPS proxy with creds Only the _frigate_ user is allowed to access the Github API via the proxy.	2024-08-14 20:26:11 -05:00
Dustin	d2b3b1f7b3	hosts: Deploy production Frigate on nvr2.p.b nvr2.pyrocufflink.blue originally ran Fedora CoreOS. Since I'm tired of the tedium and difficulty involved in making configuration changes to FCOS machines, I am migrating it to Fedora Linux, managed by Ansible.	2024-08-12 22:22:50 -05:00
Dustin	6c71d96f81	r/frigate-caddy: Deploy Caddy in front of Frigate Deploying Caddy as a reverse proxy for Frigate enables HTTPS with a certificate issued by the internal CA (via ACME) and authentication via Authelia. Separating the installation and base configuratieon of Caddy into its own role will allow us to reuse that part for other sapplications that use Caddy for similar reasons.	2024-08-12 18:47:04 -05:00
Dustin	7b61a7da7e	r/useproxy: Configure system-wide proxy The useproxy role configures the `http_proxy` et al. environmet variables for systemd services and interactive shells. Additionally, it configures Yum repositories to use a single mirror via the `baseurl` setting, rather than a list of mirrors via `metalink`, since the proxy a) the proxy only allows access to _dl.fedoraproject.org_ and b) the proxy caches RPM files, but this is only effective if all clients use the same mirror all the time. The `useproxy.yml` playbook applies this role to servers in the needproxy group.	2024-08-12 18:47:04 -05:00
Dustin	96bc8c2c09	vm-hosts: Update autostart list k8s-amd64-n0, k8s-amd64-n1, and k8s-amd64-n2 have been replaced by k8s-amd64-n4, k8s-amd64-n5, k8s-amd64-n6, respectively. db0 is the new database server, which needs to be up before anything in Kubernetes starts, since a lot of applications running there depend on it.	2024-07-03 08:52:15 -05:00
Dustin	4f202c55e4	r/postgres-exporter: Deploy postgres-exporter The [postgres-exporter][0] exposes PostgreSQL server statistics to Prometheus. It connects to a specified PostgreSQL server (in this case, a server on the local machine via UNIX socket) and collects data from the `pg_stat_activity`, et al. views. It needs the `pg_monitor` role in order to be allowed to read the relevant metrics. Since we're setting up the exporter to connect via UNIX socket, it needs a dedicated OS user to match the PostgreSQL user in order to authenticate via the _peer_ method. [0]: https://github.com/prometheus-community/postgres_exporter/	2024-07-02 20:44:29 -05:00
Dustin	3f5550ee6c	postgresql: wal-g: Set PGHOST By default, WAL-G tries to connect to the PostgreSQL server via TCP socket on the loopback interface. Our HBA configuration requires certificate authentication for TCP sockets, so we need to configure WAL-G to use the UNIX socket.	2024-07-02 20:44:29 -05:00
Dustin	6caf28259e	hosts: db0: Promote to primary All data have been migrated from the PostgreSQL server in Kubernetes and the three applications that used it (Firefly-III, Authelia, and Home Assistant) have been updated to point to the new server. To avoid comingling the backups from the old server with those from the new server, we're reconfiguring WAL-G to push and pull from a new S3 prefix.	2024-07-02 20:44:29 -05:00
Dustin	208fadd2ba	postgresql: Configure for dedicated DB servers I am going to use the postgresql group for the dedicated database servers. The configuration for those machines will be quite a bit different than for the one existing machine that is a member of that group already: the Nextcloud server. Rather than undefine/override all the group-level settings at the host level, I have removed the Nextcloud server from the postgresql group, and updated the `nextcloud.yml` playbook to apply the postgresql-server role itself. Eventually, I want to move the Nextcloud database to the central database servers. At that point, I will remove the postgresql-server role from the `nextcloud.yml` playbook.	2024-07-02 20:44:29 -05:00
Dustin	7201f7ed5c	vm-hosts: Expose storage VLAN to VMs To improve the performance of persistent volumes accessed directly from the Synology by Kubernetes pods, I've decided to expose the storage network to the Kubernetes worker node VMs. This way, iSCSI traffic does not have to go through the firewall. I chose not to use the physical interfaces that are already directly connected to the storage network for this for two reasons: 1) I like the physical separation of concerns and 2) it would add complexity to the setup by introducing a bridge on top of the existing bond.	2024-06-23 10:43:15 -05:00
Dustin	6520b86958	k8s-controller: Do not reboot after auto-updates I don't want the Kubernetes control plane servers rebooting themselves randomly; I need to coordinate that with other goings-on on the network.	2024-06-23 10:43:15 -05:00
Dustin	f0445ebe53	nextcloud: Do not auto-update Nextcloud Nextcloud usually (always?) wants the `occ upgrade` command to be run after an update. If the nextcloud package gets updated along with the rest of the OS, Nextcloud will be down until I manually run that command hours/days later.	2024-06-23 10:43:15 -05:00
Dustin	24bf145a34	all: Do not auto-update on weekends I don't want machines updating themselves, rebooting, and potentially breaking stuff over the weekend.	2024-06-21 22:08:03 -05:00
Dustin	88c45e22b6	vm-hosts: Update VM autostart for new DCs	2024-06-20 18:49:04 -05:00
Dustin	292ab4585c	all: promtail: Update trusted CA certificate Loki uses a certificate signed by dch-ca r2 now (actually has for quite some time...)	2024-06-12 18:57:01 -05:00
Dustin	ffe972d79b	r/samba-cert: Obtain LDAP/TLS cert via ACME The samba-cert role configures `lego` and HAProxy to obtain an X.509 certificate via the ACME HTTP-01 challenge. HAProxy is necessary because LDAP server certificates need to have the apex domain in their SAN field, and the ACME server may contact any domain controller server with an A record for that name. HAProxy will forward the challenge request on to the first available host on port 5000, where `lego` is listening to provide validation. Issuing certificates this way has a couple of advantages: 1. No need for the wildcard certificate for the pyrocufflink.blue domain any more 2. Renewals are automatic and handled by the server itself rather than Ansible via scheduled Jenkins job Item (2) is particularly interesting because it avoids the bi-monthly issue where replacing the LDAP server certificate and restarting Samba causes the Jenkins job to fail. Naturally, for this to work correctly, all LDAP client applications need to trust the certificates issued by the ACME server, in this case DCH Root CA R2.	2024-06-12 18:33:24 -05:00
Dustin	58972cf188	auto-updates: Install and configure dnf-automatic dnf-automatic is an add-on for `dnf` that performs scheduled, automatic updates. It works pretty much how I would want it to: triggered by a systemd timer, sends email reports upon completion, and only reboots for kernel et al. updates. In its default configuration, `dnf-automatic.timer` fires every day. I want machines to update weekly, but I want them to update on different days (so as to avoid issues if all the machines reboot at once). Thus, the _dnf-automatic_ role uses a systemd unit extension to change the schedule. The day-of-the-week is chosen pseudo-randomly based on the host name of the managed system.	2024-06-12 06:25:17 -05:00
Dustin	1f86fa27b6	vm-hosts: Auto-start unifi2	2024-05-26 10:51:16 -05:00
Dustin	5a9b8b178a	hosts: Decommission unifi1 unifi1.pyrocufflink.blue is being replaced with unifi2.pyrocufflink.blue. The new server runs Fedora CoreOS.	2024-05-26 10:50:32 -05:00
Dustin	06b399994e	public-web: Add Tabitha's new SSH key We got Nicepage to work on Tabitha's Fedora Thinkpad, so now she'll do most of her website work on that machine.	2024-03-15 10:29:03 -05:00
Dustin	0578736596	unifi: Scrape logs from UniFi and device syslog The UniFi controller can act as a syslog server, receiving log messages from managed devices and writing them to files in the `logs/remote` directory under the application data directory. We can scrape these logs, in addition to the logs created by the UniFi server itself, with Promtail to get more information about what's happening on the network.	2024-02-28 19:04:30 -06:00
Dustin	19009bde1a	promtail: Role/Playbook to deploy Promtail Promtail is the log sending client for Grafana Loki. For traditional Linux systems, an RPM package is available from upstream, making installation fairly simple. Configuration is stored in a YAML file, so again, it's straightforward to configure via Ansible variables. Really, the only interesting step is adding the _promtail_ user, which is created by the RPM package, to the _systemd-journal_ group, so that Promtail can read the systemd journal files.	2024-02-22 19:23:31 -06:00
Dustin	226a9e05fa	nut: Drop group NUT is managed by _cfg.git_ now.	2024-02-22 10:24:16 -06:00
Dustin	493663e77f	frigate: Drop group Frigate is no longer managed by Ansible. Dropping the group so the file encrypted with Ansible Vault can go away.	2024-02-22 10:23:19 -06:00
Dustin	fdc59fe73b	pyrocufflink-dns: Drop group The internal DNS server for the pyrocufflink.blue et al. domains runs on the firewall now, and is thus no longer managed by Ansible. Dropping the group variables so the file encrypted with Ansible Vault can go away.	2024-02-22 10:23:19 -06:00
Dustin	19d833cc76	websites/d&t.com: drop obsolete formsubmit config The dustinandtabitha.com website no longer uses formsubmit (the time for RSVP has long passed). Removing the configuration so the file encrypted with Ansible Vault can go away.	2024-02-22 10:23:19 -06:00
Dustin	f9f8d5aa29	Remove grafana, metricspi groups With the Metrics Pi decommissioned and Victoria Metrics and Grafana running in Kubernetes now, these groups are no longer needed.	2024-02-22 10:23:19 -06:00
Dustin	f83cea50e9	r/ssu-user-ca: Configure sshd TrustedUserCAKeys The `TrustedUserCAKeys` setting for sshd(8) tells the server to accept any certificates signed by keys listed in the specified file. The authenticating username has to match one of the principals listed in the certificate, of course. This role is applied to all machines, via the `base.yml` playbook. Certificates issued by the user CA managed by SSHCA will therefore be trusted everywhere. This brings us one step closer to eliminating the dependency on Active Directory/Samba.	2024-02-01 18:46:40 -06:00
Dustin	0d30e54fd5	r/fileserver: Restrict non-administrators to SFTP Normal users do not need shell access to the file server, and certainly should not be allowed to e.g. forward ports through it. Using a `Match` block, we can apply restrictions to users who do not need administrative functionality. In this case, we restrict everyone who is not a member of the Server Admins group in the PYROCUFFLINK AD domain.	2024-02-01 10:29:32 -06:00
Dustin	4b8b5fa90b	pyrocufflink: Enable pam_ssh_agent_auth for sudo By default, `sudo` requires users to authenticate with their passwords before granting them elevated privileges. It can be configured to allow (some) users access to (some) privileged commands without prompting for a password (i.e. `NOPASSWD`), however this has a real security implication. Disabling the password requirement would effectively grant any program root privileges. Prompting for a password prevents malicious software from running privileged commands without the user knowing. Unfortunately, handling `sudo` authentication for Ansible is quite cumbersome. For interactive use, the `--ask-become-pass`/`-K` argument is useful, though entering the password for each invocation of `ansible-playbook` while iterating on configuration policy development is a bit tedious. For non-interactive use, though, the password of course needs to be stored somewhere. Encrypting it with Ansible Vault is one way to protect it, but it still ends up stored on disk somewhere and needs to be handled carefully. pam_ssh_agent_auth provides an acceptable solution to both issues. It is better than disabling `sudo` authentication entirely, but a lot more convenient than dealing with passwords. It uses the calling user's SSH agent to assert that the user has access to a private key corresponding to one of the authorized public keys. Using SSH agent forwarding, that private key can even exist on a remote machine. If the user does not have a corresponding private key, `sudo` will fall back to normal password-based authentication. The security of this solution is highly dependent on the client to store keys appropriately. FIDO2 keys are supported, though when used with Ansible, it is quite annoying to have to touch the token for _every task_ on _every machine_. Thus, I have created new FIDO2 keys for both my laptop and my desktop that have the `no-touch-required` option enabled. This means that in order to use `sudo` remotely, I still need to have my token plugged in to my computer, but I do not have to tap it every time it's used. For Jenkins, a hardware token is obviously impossible, but using a dedicated key stored as a Jenkins credential is probably sufficient.	2024-01-28 12:16:35 -06:00
Dustin	7b54bc4400	nut-monitor: Require both UPS to be online Unfortunately, the automatic transfer switch does not seem to work correctly. When the standby source is a UPS running on battery, it does not switch sources if the primary fails. In other words, when the power is out and both UPS are running on battery, when the first one dies, it will NOT switch to the second one. It has no trouble switching when the second source is mains power, though, which is very strange. I have tried messing with all the settings including nominal input voltage, sensitivity, and frequency tolerence, but none seem to have any effect. Since it is more important for the machines to shut down safely than it is to have an extra 10-15 minutes of runtime during an outage, the best solution for now is to configure the hosts to shut down as soon as the first UPS battery gets low. This is largely a waste of the second UPS, but at least it will help prevent data loss.	2024-01-25 21:22:04 -06:00
Dustin	236e6dced6	r/web/hlc: Add formsubmit config for summer signup And of course, Tabitha lost her SSH key so she had to get another one.	2024-01-23 22:04:29 -06:00
Dustin	07f84e7fdc	vm-hosts: Increase VM start delay after K8s Increasing the delay after starting the Kubernetes cluster to hopefully allow things to "settle down" enough that starting services on follow up VMs doesn't time out.	2024-01-22 08:35:40 -06:00
Dustin	6f4fb70baa	vm-hosts: Clean up vm-autostart list Start Kubernetes earlier. Start Synapse later (it takes a long time to start up and often times out when the VM hosts are under heavy load). Start SMTP relay later as it's not really needed.	2024-01-21 18:42:28 -06:00
Dustin	b4fcbb8095	unifi: Deploy unifi_exporter `unifi_exporter` provides Prometheus metrics for UniFi controller.	2024-01-21 16:12:29 -06:00
Dustin	6f5b400f4a	vm-hosts: Fix test network device name The network device for the test/pyrocufflink.red network is named `br1`. This needs to match in the systemd-networkd configuration or libvirt will not be able to attach virtual machines to the bridge.	2024-01-21 15:55:37 -06:00
Dustin	fb445224a0	vm-hosts: Add k8s-amd64-n3 to autostart list	2024-01-21 15:55:23 -06:00
Dustin	525f2b2a04	nut-monitor: Configure upsmon `upsmon` is the component of [NUT] that monitors (local or remote) UPS devices and reacts to changes in their state. Notably, it is responsible for powering down the system when there is insufficient power to the system.	2024-01-19 20:50:03 -06:00
Dustin	ab30fa13ca	file-servers: Set Apache ServerName Since file0.pyrocufflink.blue now hosts a couple of VirtualHosts, accessing its HTTP server by the files.pyrocufflink.blue alias no longer works, as Apache routes unknown hostnames to the first VirtualHost, rather than the global configuration. To resolve this, we must set `ServerName` to the alias.	2023-12-29 10:46:13 -06:00
Dustin	dfd828af08	r/ssh-host-certs: Manage SSH host certificates The ssh-host-certs role, which is now applied as part of the `base.yml` playbook and therefore applies to all managed nodes, is responsible for installing the sshca-cli package and using it to request signed SSH host certificates. The sshca-cli-systemd sub-package includes systemd units that automate the process of requesting and renewing host certificates. These units need to be enabled and provided the URL of the SSHCA service. Additionally, the SSH daemon needs to be configured to load the host certificates.	2023-11-07 21:27:02 -06:00
Dustin	c6f0ea9720	r/repohost: Configure Yum package repo host So it turns out Gitea's RPM package repository feature is less than stellar. Since each organization/user can only have a single repository, separating packages by OS would be extremely cumbersome. Presumably, the feature was designed for projects that only build a single PRM for each version, but most of my packages need multiple builds, as they tend to link to system libraries. Further, only the repository owner can publish to user-scoped repositories, so e.g. Jenkins cannot publish anything to a repository under my dustin account. This means I would ultimately have to create an Organization for every OS/version I need to support, and make Jenkins a member of it. That sounds tedious and annoying, so I decided against using that feature for internal packages. Instead, I decided to return to the old ways, publishing packages with `rsync` and serving them with Apache. It's fairly straightforward to set this up: just need a directory with the appropriate permissions for users to upload packages, and configure Apache to serve from it. One advantage Gitea's feature had over a plain directory is its automatic management of repository metadata. Publishers only have to upload the RPMs they want to serve, and Gitea handles generating the index, database, etc. files necessary to make the packages available to Yum/dnf. With a plain file host, the publisher would need to use `createrepo` to generate the repository metadata and upload that as well. For repositories with multiple packages, the publisher would need a copy of every RPM file locally in order for them to be included in the repository metadata. This, too, seems like it would be too much trouble to be tenable, so I created a simple automatic metadata manager for the file-based repo host. Using `inotifywatch`, the `repohost-createrepo` script watches for file modifications in the repository base directory. Whenever a file is added or changed, the directory containing it is added to a queue. Every thirty seconds, the queue is processed; for each unique directory in the queue, repository metadata are generated. This implementation combines the flexibility of a plain file host, supporting an effectively unlimited number of repositories with fully-configurable permissions, and the ease of publishing of a simple file upload.	2023-11-07 20:51:10 -06:00
Dustin	6955c4e7ad	hosts: Decommission dc-4k6s8e.p.b Replaced by dc-nrtxms.pyrocufflink.blue	2023-10-28 16:07:56 -05:00
Dustin	420764d795	hosts: Add dc-nrtxms.p.b New Fedora 38 Active Directory Domain Controller	2023-10-28 16:07:39 -05:00
Dustin	a8c184d68c	hosts: Decommission dc-ag62kz.p.b Replaced by dc-qi85ia.pyrocufflink.blue	2023-10-28 16:07:08 -05:00
Dustin	686817571e	smtp-relay: Switch to Fastmail AWS is going to begin charging extra for routable IPv4 addresses soon. There's really no point in having a relay in the cloud anymore anyway, since a) all outbound messages are sent via the local relay and b) no messages are sent to anyone except me.	2023-10-24 17:27:21 -05:00
Dustin	1b9543b88f	metricspi: alerts: Increase Frigate disk threshold We want the Frigate recording volume to be basically full at all times, to ensure we are keeping as much recording as possible.	2023-10-15 09:52:12 -05:00
Dustin	2f554dda72	metricspi: Scrape k8s-aarch64-n1 I've added a new Kubernetes worker node, k8s-aarch64-n1.pyrocufflink.blue. This machine is a Raspberry Pi CM4 mounted on a Waveshare CM4-IO-Base A and clipped onto the DIN rail. It's got 8 GB of RAM and 32 GB of eMMC storage. I intend to use it to build container images locally, instead of bringing up cloud instances.	2023-10-05 14:32:19 -05:00
Dustin	a74113d95f	metricspi: Scrape Zincati metrics from CoreOS hosts Zincati is the automatic update manager on Fedora CoreOS. It exposes Prometheus metrics for host/update statistics, which are useful to track the progress of automatic updates and identify update issues. Zinciti actually exposes its metrics via a Unix socket on the filesystem. Another process, [local_exporter], is required to expose the metrics from this socket via HTTP so Prometheus can scrape them. [local_exporter]: https://github.com/lucab/local_exporter	2023-10-03 10:29:12 -05:00
Dustin	d7f778b01c	metricspi: Scrape metrics from k8s-aarch64-n0 collectd is now running on k8s-aarch64-n0.pyrocufflink.blue, exposing system metrics. As it is not a member of the AD domain, it has to be explicitly listed in the `scrape_collectd_extra_targets` variable.	2023-10-03 10:29:11 -05:00
Dustin	50f4b565f8	hosts: Remove nvr1.p.b as managed system nvr1.pyrocufflink.blue has been migrated to Fedora CoreOS. As such, it is no longer managed by Ansible; its configuration is done via Butane/Ignition. It is no longer a member of the Active Directory domain, but it does still run collectd and export Prometheus metrics.	2023-09-27 20:24:47 -05:00
Dustin	7a9c678ff3	burp-server: Keep more backups New retention policy: * 7 daily backups * 4 weekly backups * 12 ~monthly backups * 5 ~yearly backups	2023-07-17 16:36:37 -05:00
Dustin	06782b03bb	vm-hosts: Update VM autostart list * dc2 is gone for a long time, replaced by two new domain controllers * unifi0 was recently replaced by unifi1	2023-07-07 10:05:22 -05:00
Dustin	71a43ccf07	unifi: Deploy Unifi Network controller Since Ubiquiti only publishes Debian packages for the Unifi Network controller software, running it on Fedora has historically been neigh impossible. Fortunately, a modern solution is available: containers. The linuxserver.io project publishes a container image for the controller software, making it fairly easy to deploy on any host with an OCI runtime. I briefly considered creating my own image, since theirs must be run as root, but I decided the maintenance burden would not be worth it. Using Podman's user namespace functionality, I was able to work around this requirement anyway.	2023-07-07 10:05:01 -05:00
Dustin	61844e8a95	pyrocufflink: Add Luma SSH keys for root Sometimes I need to connect to a machine when there is an AD issue (e.g. domain controllers are down, clocks are out of sync, etc.) but I can't do it from my desktop.	2023-07-05 16:35:57 -05:00
Dustin	0a68d84121	metricspi: Scrape hatchlearningcenter.org To monitor site availability and certificate expiration.	2023-06-21 14:31:33 -05:00
Dustin	4e608e379f	metricspi/alerts: Correct BURP archive alert query When the RAID array is being resynchronized after the archived disk has been reconnected, md changes the disk status from "missing" to "spare." Once the synchronization is complete, it changes from "spare" to "active." We only want to trigger the "disk needs archived" alert once the synchronization process is complete; otherwise, both the "disks need swapped" and "disk needs archived" alerts would be active at the same time, which makes no sense. By adjusting the query for the "disk needs archived" alert to consider disks in both "missing" and "spare" status, we can delay firing that alert until the proper time.	2023-06-20 11:58:35 -05:00
Dustin	bf4d57b5cb	frigate: Configure journal2ntfy for MD RAID The Frigate server has a RAID array that it uses to store video recordings. Since there have been a few occasions where the array has suddenly stopped functioning, probably because of the cheap SATA controller, it will be nice to get an alert as soon as the kernel detects the problem, so as to minimize data loss.	2023-06-08 10:05:36 -05:00
Dustin	87e8ec2ed4	synapse: Back up data using BURP Most of the Synapse server's state is in its SQLite database. It also has a `media_store` directory that needs to be backed up, though. In order to back up the SQLite database while the server is running, the database must be in "WAL mode." By default, Synapse leaves the database in the default "rollback journal mode," which disallows multiple processes from accessing the database, even for read-only operations. To change the journal mode: ```sh sudo systemctl stop synapse sudo -u synapse sqlite3 /var/lib/synapse/homeserver.db 'PRAGMA journal_mode=WAL;' sudo systemctl start synapse ```	2023-05-23 09:52:50 -05:00
Dustin	78296f7198	Merge branch 'journal2ntfy'	2023-05-23 08:31:52 -05:00
Dustin	347cda74fd	metrics: Scrape metrics from Kubernetes API server Kubernetes exports a lot of metrics in Prometheus format. I am not sure what all is there, yet, but apparently several thousand time series were added. To allow anonymous access to the metrics, I added this RoleBinding: ```yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get ```	2023-05-22 21:21:08 -05:00
Dustin	c0bb387b18	metricspi: Scrape metrics from MinIO backup storage MinIO exposes metrics in Prometheus exposition format. By default, it requires an authentication token to access the metrics, but I was unable to get this to work. Fortunately, it can be configured to allow anonymous access to the metrics, which is fine, in my opinion.	2023-05-22 21:19:25 -05:00
Dustin	a7319c561d	journal2ntfy: Script to send log messagess via ntfy The `journal2ntfy.py` script follows the systemd journal by spawning `journalctl` as a child process and reading from its standard output stream. Any command-line arguments passed to `journal2ntfy` are passed to `journalctl`, which allows the caller to specify message filters. For any matching journal message, `journal2ntfy` sends a message via the ntfy web service. For the BURP server, we're going to use `journal2ntfy` to generate alerts about the RAID array. When I reconnect the disk that was in the fireproof safe, the kernel will log a message from the md subsystem indicating that the resynchronization process has begun. Then, when the disks are again in sync, it will log another message, which will let me know it is safe to archive the other disk.	2023-05-17 14:51:21 -05:00
Dustin	2c002aa7c5	alerts: Add alert to archive BURP disk This alert will fire once the MD RAID resynchronization process has completed and both disks in the array are online. It will clear when one disk is disconnected and moved to the safe.	2023-05-16 08:33:13 -05:00
Dustin	877dcc3879	alerts: Add alerts for missed client backups When BURP fails to even start a backup, it does not trigger a notification at all. As a result, I may not notice for a few days when backups are not happening. That was the case this week, when clients' backups were failing immediately, because of a file permissions issue on the server. To hopefully avoid missing backups for too long in the future, I've added two new alerts: * The no recent backups alert fires if there have not been any BURP backups recently. This may also fire, for example, if the BURP exporter is not working, or if there is something wrong with the BURP data volume. * The missed client backup alert fires if an active BURP client (i.e. one that has had at least one backup in the past 90 days) has not been backed up in the last 24 hours.	2023-05-14 11:48:36 -05:00
Dustin	a2bcd5ccbb	alerts: Adjust BURP RAID disk swap alert Using a 30-day window for the `tlast_change_over_time` function effectively "caps out" the value at 30 days. Thus, the alert reminding me to swap the BURP backup volume will never fire, since the value will never be greater than the 30-day threshold. Using a wider window resolves that issue (though the query will still produce inaccurate results beyond the window).	2023-05-14 11:38:00 -05:00
Dustin	ad9fb6798e	samba-dc: Omit tls cafile setting The `tls cafile` setting in `smb.conf` is not necessary. It is used for verifying peer certificates for mutual TLS authentication, not to specify the intermediate certificate authority chain like I thought. The setting cannot simply be left out, though. If it is not specified, Samba will attempt to load a file from a built-in default path, which will fail, causing the server to crash. This is avoided by setting the value to the empty string.	2023-05-10 08:28:49 -05:00
Dustin	9722fed1b8	metricspi: Scrape dustinandtabitha.com	2023-05-09 21:30:11 -05:00
Dustin	f6f286ac24	alerts: Correct BURP volume swap alert The `tlast_change_over_time` function needs an interval wide enough to consider the range of time we are intrested in. In this case, we want to see if the BURP volume has been swapped in the last thirty days, so the interval needs to be `30d`.	2023-05-03 11:06:34 -05:00
Dustin	5ed3ee525e	synapse: Update LDAP server URI	2023-05-01 12:36:33 -05:00
Dustin	a4cc9d0c46	metricspi: Scrape tabitha.biz	2023-04-23 20:03:43 -05:00
Dustin	6c68126a3a	grafana: Update LDAP server host name dc0.p.b has been gone for a while now. All the current domain controllers use LDAPS certificates signed by Let's Encrypt and include the pyrocufflink.blue name, so we can now use the apex domain A record to connect to the directory.	2023-04-12 14:07:51 -05:00
Dustin	78f65355fa	gitea: Back up with BURP	2023-04-12 14:07:51 -05:00
Dustin	1da4c17a8c	alerts: Add alerts for HTTPS certificates These alerts will generate notifications when websites' HTTPS certificates are not properly renewed automatically and become in danger of expiring.	2023-04-12 13:55:31 -05:00
Dustin	bf4133652c	metrics: Scrape Jenkins with blackbox exporter This is mostly to monitor the HTTPS certificate expiration.	2023-04-12 13:55:31 -05:00
Dustin	dc2a05dc8f	alerts: Add alert for BURP RAID array swap This alert counts how long its been since the number of "active" disks in the RAID array on the BURP server has changed. The assumption is that the number will typically be `1`, but it will be `2` when the second disk synchronized before the swap occurs.	2023-04-11 22:25:36 -05:00
Dustin	2394bf7436	metricspi: Fix vmalert links 1. Grafana 8 changed the format of the query string parameters for the Explore page. 2. vmalert no longer needs the http.pathPrefix argument when behind a reverse proxy, rather it uses the request path like the other Victoria Metrics components.	2023-04-11 21:46:43 -05:00
Dustin	6c562c9821	alerts: Ignore missing mdraid disk for BURP The way I am handling swapping out the BURP disk now is by using the Linux MD RAID driver to manage a RAID 1 mirror array. The array normally operates with one disk missing, as it is in the fireproof safe. When it is time to swap the disks, I reattach the offline disk, let the array resync, then disconnect and store the other disk. This works considerably better than the previous method, as it does not require BURP or the NFS server to be offline during the synchronization.	2023-04-11 20:08:07 -05:00
Dustin	a59f24a8b5	metricspi: Stop scraping speedtest Running the speed test periodically was just wasting bandwidth. It failed frequently, and generally did not provide useful information.	2023-04-02 11:05:16 -05:00
Dustin	94de5d6067	samba-dc: Decrease Samba log level The default log level (3) produces too much output and quickly fills the `/var/log` volume on the domain controllers.	2023-03-08 11:26:57 -06:00
Dustin	748c432334	vaultwarden: Change Domain URL The rule is "if it is accessible on the Internet, its name ends in .net" Although Vaultwarden can be accessed by either name, the one specified in the Domain URL setting is the only one that works for WebAuthn.	2023-03-03 11:17:07 -06:00
Dustin	632e1dd906	metricspi: Update LDAP configuration All domain controllers now use the Let's Encrypt wildcard certificate for the pyrocufflink.blue domain. Further, dc2.p.b is decommissioned.	2023-01-09 12:23:54 -06:00
Dustin	90f9e5eba5	samba-dc: Manage sudoers Domain controllers only allow users in the Domain Admins AD group to use `sudo` by default. dustin and jenkins need to be able to apply configuration policy to these machines, but they are not members of said group.	2022-12-23 08:47:31 -06:00
Dustin	9408ee31c3	home-assistant: Back up Zigbee/ZWave/Mosquitto Mosquitto, Zigbee2MQTT, and ZWaveJS2MQTT all have persistent state that needs to be backed up in addition to Home Assistant's own data.	2022-12-23 06:56:52 -06:00
Dustin	77191c8b5a	Fedora37: Set collectd SELinux domain permissive collectd is broken by default on Fedora 36 and 36. Several plugins generate AVC denials.	2022-12-19 10:22:00 -06:00
Dustin	637289036a	blackbox: Update pyrocufflink DNS check I changed the naming convention for domain controller machines. They are no longer "numbered," since the plan is to rotate through them quickly. For each release of Fedora, we'll create two new domain controllers, replacing the existing ones. Their names are now randomly generated and contain letters and numbers, so the Blackbox Exporter check for DNS records needs to account for this.	2022-12-19 09:04:37 -06:00
Dustin	caef7f342b	vm-hosts: Update autostart list * Remove DC0 (decommissioned) * Remove Jenkins and its build VMs (Migrated to Kubernetes) * Add pxe0 (Required for Basement HUD)	2022-12-18 19:55:48 -06:00
Dustin	77c6408187	metricspi: Remove sensors scrape job Sensor data are retrieved via Home Assistant.	2022-12-18 19:16:10 -06:00
Dustin	244482ac52	websites: Add hatchlearningcenter.org This is the website for Tabitha's new hybrid private school! 👩‍🎓	2022-11-30 22:04:29 -06:00
Dustin	772f669ab2	r/gitea: Handle encoded / characters in HTTP paths Gitea package names (e.g. OCI images, etc.) can contain `/` charactres. These are encoded as %2F in request paths. Apache needs to forward these sequences to the Gitea server without decoding them. Unfortunately, the `AllowEncodedSlashes` setting, which controls this behavior, is a per-virtualhost setting that is not inherited from the main server configuration, and therefore must be explicitly set inside the `VirtualHost` block. This means Gitea needs its own virtual host definition, and cannot rely on the default virtual host.	2022-11-27 17:21:03 -06:00
Dustin	4511d5447e	vm-hosts: Add missing kube.network config When I added the systemd-networkd configuration for the Kubernetes network interface on the VM hosts, I only added the `.netdev` configuration and forgot the `.network` part. Without the latter, systemd-networkd creates the interface, but does not configure or activate it, so it is not able to handle traffic for the VMs attached to the bridge.	2022-08-22 20:00:47 -05:00
Dustin	b8b8ae5798	vm-hosts: Define machines to auto start	2022-08-20 21:19:01 -05:00
Dustin	bc60451949	metricspi: Update DNS server address DNS is now handled by the border firewall.	2022-08-20 18:19:13 -05:00
Dustin	4622240c6c	r/netboot/jenkins-agent: Configure NBD exports The netboot/jenkins-agent Ansible role configures three NBD exports: * A single, shared, read-only export containing the Jenkins agent root filesystem, as a SquashFS filesystem * For each defined agent host, a writable data volume for Jenkins workspaces * For each defined agent host, a writable data volume for Docker Agent hosts must have some kind of unique value to identify their persistent data volumes. Raspberry Pi devices, for example, can use the SoC serial number.	2022-08-15 17:14:06 -05:00
Dustin	dbc18022f2	metricspi: Increase scrape_timeout for speedtest Running the Internet speed test can often take longer than a minute.	2022-08-12 14:54:49 -05:00

1 2 3 4 5 ...

301 Commits (d4d3f0ef811e8d7286546637aaf40fb6ce0bc908)