infra/cfg - cfg - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Dustin	1db158c150	nvr2: Deploy collectd	2024-04-07 11:18:42 -05:00
Dustin	97ba882cb2	prod: frigate: Set LIBVA driver name Frigate defaults to using the intel VA-API driver, but nvr2.p.b has an AMD GPU.	2024-04-05 22:26:41 -05:00
Dustin	41251a52cd	wip: app/frigate: Deploy Caddy Running Caddy in front of Frigate to provide HTTPS and authentication.	2024-04-05 22:26:36 -05:00
Dustin	ee66e9ea18	caddy: Separate out from loki app This will make it more clear when sharing Caddy resources with other applications (e.g. Frigate).	2024-04-05 22:05:21 -05:00
Dustin	b5fea000fa	prod: Add upsmon password for nvr2	2024-04-05 22:05:21 -05:00
Dustin	d432c673e9	host: Add nvr2.p.b nvr2.pyrocufflink.blue runs Frigate video recording software.	2024-04-05 22:05:21 -05:00
Dustin	aeddab46ff	env/prod: Add values for Frigate Imported as-is from nvr1.pyrocufflink.blue.	2024-04-05 22:05:21 -05:00
Dustin	cd64b3bccb	app/frigate: Add schema, templates for Frigate [Frigate] is an open source network video recording software with advanced motion detection using machine learning object detection. It uses `ffmpeg` to stream video from one or more RTSP-capable IP video cameras and passes the images through an object detection process. To improve the performance of the machine learning model, it supports using a Coral EdgeTPU device, which requires special drivers: `gasket` and `apex`. Frigate is configured via a (rather compex) YAML document, some of the schema of which is modeled in `schema.cue` (the parts I need, anyway). [Frigate]: https://frigate.video/	2024-04-05 20:27:00 -05:00
Dustin	c4dcb5a8de	loki: Enable auto-restart Sometimes Loki fails to start or otherwise isn't running. To minimize loss of log data, we need it to restart automatically when possible.	2024-03-28 10:11:38 -05:00
Dustin	ba5ba257c1	loki: Increase start timeout It can sometimes take a very long time for Loki to start, for reasons that are not entirely clear...	2024-03-28 10:09:01 -05:00
Dustin	d989994f25	serterm: Deploy serial terminal server The serial terminal server ("serterm") is a collection of scripts that automate launching multiple `picocom` processes, one per USB-serial adapter connected to the system. Each `picocom` process has its own window in a `tmux` session, which is accessible via SSH on a dedicated port (20022). Clients connecting to that SSH server will be automatically attached to the `tmux` session, allowing them to access the serial terminal server quickly and easily. The SSH server only allows public-key authentication, so the authorized keys have to be pre-configured. In addition to automatically launching `picocom` windows for each serial port when the terminal server starts, ports that are added (hot-plugged) while the server is running will have windows created for them automatically, by way of a udev rule. Each `picocom` process is configured to log communications with its respective serial port. This may be useful, for example, to find diagnostic messages that may not be captured by the `tmux` scrollback buffer.	2024-03-21 21:24:12 -05:00
Dustin	9779ac795d	Merge branch 'promtail'	2024-02-21 07:48:42 -06:00
Dustin	01d8f7043b	loki: Require X-Grafana-User HTTP header I discovered today that if anonymous Grafana users have Viewer permission, they can use the Datasource API to make arbitrary queries to any backend, even if they cannot access the Explore page directly. This is documented ([issue #48313][0]) as expected behavior. I don't really mind giving anonymous access to the Victoria Metrics datasource, but I definitely don't want anonymous users to be able to make Loki queries and view log data. Since Grafana Datasource Permissions is limited to Grafana Enterprise and not available in the open source version of Grafana, the official recommendation from upstream is to use a separate Organization for the Loki datasource. Unfortunately, this would preclude having dashboards that have graphs from both data sources. Although I don't have any of those right now, I like the idea and may build some eventually. Fortunately, I discovered the `send_user_header` Grafana configuration option. With this enabled, Grafana will send an `X-Grafana-User` header with the username of the user on whose behalf it is making a request to the backend. If the user is not logged in, it does not send the header. Thus, we can detect the presence of this header on the backend and refuse to serve query requests if it is missing. [0]: https://github.com/grafana/grafana/issues/48313	2024-02-21 07:47:51 -06:00
Dustin	cdd6a62b5d	promtail: Update loki port With Loki behind a reverse proxy now, clients access it using the default HTTPS port (443).	2024-02-21 07:47:51 -06:00
Dustin	878ff7acb5	loki: Deploy Caddy in front of Loki Grafana Loki explicitly eschews built-in authentication. In fact, its [documentation][0] states: > Operators are expected to run an authenticating reverse proxy in front > of your services. While I don't really want to require authentication for agents sending logs, I definitely want to restrict querying and viewing logs to trusted users. There are _many_ reverse proxy servers available, and normally I would choose _nginx_. In this case, though, I decided to try Caddy, mostly because of its built-in ACME support. I wasn't really happy with how the `fetchcert` system turned out, particularly using the Kubernetes API token for authentication. Since the token will eventually expire, it will require manual intervention to renew, thus mostly defeating the purpose of having an auto-renewing certificate. So instead of using _cert-manager_ to issue the certificate and store it in Kubernetes, and then having `fetchcert` download it via the Kubernetes API, I set up _step-ca_ to handle issuing the certificate directly to the server. When Caddy starts up, it contacts _step-ca_ via ACME and handles the challenge verification automatically. Further, it will automatically renew the certificate as necessary, again using ACME. I didn't spend a lot of time optimizing the Caddy configuration, so there's some duplication there (i.e. the multiple `reverse_proxy` statements), but the configuration works as desired. Clients may provide a certificate, which will be verified against the trusted issuer CA. If the certificate is valid, the client may access any Loki resource. Clients that do not provide a certificate can only access the ingestion path, as well as the "ready" and "metrics" resources. [0]: https://grafana.com/docs/loki/latest/operations/authentication/	2024-02-21 07:47:51 -06:00
Dustin	5e10f2c1e7	promtail: Increase start timeout The Promtail container image is pretty big, so it takes quite some time to pull on a slow machine like a Raspberry Pi. Let's increase the startup timeout so the service is less likely to fail while the image is still being pulled.	2024-02-20 07:27:11 -06:00
Dustin	ae948489e3	Deploy Promtail to all non-Kubernetes nodes All the stand-alone FCOS hosts now have Promtail running, forwarding _systemd_ journal messages to Grafana Loki. The Kubernetes nodes will have Promtail deployed as a Kubernetes pod. I would really like to come up with a way to define variables for groups of hosts, so that I do not have to add `promtail: prod.#promtail` to every host's values file individually...	2024-02-18 12:59:14 -06:00
Dustin	45c35c065a	promtail: Deploy Loki Promtail Agent [Promtail][0] is the log collection agent for Grafana Loki. It reads logs from various locations, including local files and the _systemd_ journal and sends them to Loki via HTTP. Loki configuration is a highly-structured YAML document. Thus, instead of using Tera template syntax for loops, conditionals, etc., we can use the full power of CUE to construct the configuration. Using the `Marshal` function from the built-in `encoding/yaml` package, we serialize the final configuration structure as a string and write it verbatim to the configuration file. I have modeled most of the Promtail configuration schema in the `du5t1n.me/cfg/app/promtail/schema` package. Having the schema modeled will ensure the generated configuration is valid during development (i.e. `cue export` will fail if it is not), which will save time pushing changes to machines and having Loki complain. The `#promtail` "function" in `du5t1n.me/cfg/env/prod` makes it easy to build our desired configuration. It accepts an optional `#scrape` field, which can be used to provide specific log scraping definitions. If it is unspecified, the default configuration is to scrape the systemd journal. Hosts with additional needs can supply their own list, probably including the `promtail.scrape.journal` object in it to get the default journal scrape job. [0]: https://grafana.com/docs/loki/latest/send-data/promtail/	2024-02-18 11:35:13 -06:00
Dustin	4608f19724	loki: Add ExecReload to systemd service unit According to the [Grafana Loki documentation][0], sending SIGHUP to the Loki process will instruct it to reload its configuration. This is necessary in order for it to re-read its server certificate after it has been renewed. [0]: https://grafana.com/docs/loki/latest/configure/#reload-at-runtime	2024-02-18 11:35:13 -06:00
Dustin	011058aec3	loki: Use fetchcert to manage server certificate Before going into production with Grafana Loki, I want to set it up to use TLS. To that end, I have configured _cert-manager_ to issue it a certificate, signed by _DCH CA_. In order to use said certificate, we need to configure `fetchcert` to run on the Loki server.	2024-02-18 11:35:13 -06:00
Dustin	29afcae52e	fetchcert: Deploy tool to get cert from k8s Secret The `fetchcert` tool is a short shell script that fetches an X.509 certificate and corresponding private key from a Kubernetes Secret, using the Kubernetes API. I originally wrote it for the Frigate server so it could fetch the _pyrocufflink.blue_ wildcard certificate, which is managed by _cert-manager_. Since then, I have adapted it to be more generic, so it will be useful to fetch the _loki.pyrocufflink.blue_ certificate for Grafana Loki. Although the script is rather simple, it does have several required configuration parameters. It needs to know the URL of the Kubernetes API server and have the certificate for the CA that signs the server certificate, as well as an authorization token. It also needs to know the namespace and name of the Secret from which it will fetch the certificate and private key. Finally, needs to know the paths to the files where the fetched data will be written. Generally, after certificates are updated, some action needs to be performed in order to make use of them. This typically involves restarting or reloading a daemon. Since the `fetchcert` tool runs in a container, it can't directly perform those actions, so it simply indicates via a special exit code that the certificate has been updated and some further action may be needed. The `/etc/fetchcert/postupdate.sh` script is executed by _systemd_ after `fetchcert` finishes. If the `EXIT_STATUS` environment variable (which is set by _systemd_ to the return code of the main service process) matches the expected code, the configured post-update actions will be executed.	2024-02-18 10:48:01 -06:00
Dustin	f793249ed3	collectd: df: Ignore autofs mount points When _collectd_ calls statvfs(3) on paths like `/host/proc/sys/fs/binfmt_misc` which are configured for auto-mounting, _systemd_ logs hundreds of messages like these: ``` systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1303 (reader#3) systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active? ``` Eventually, _collectd_ logs an error: ``` collectd[1132]: statvfs(/host/proc/sys/fs/binfmt_misc) failed: Too many levels of symbolic links ``` This happens on every scrape interval. To avoid this, we can configure _collectd_ to skip calling statvfs(3) on _autofs_ mount points. Even if it did work correctly, we wouldn't really want _collectd_ triggering automounts; that would pretty much defeat the purpose of them.	2024-02-17 21:36:21 -06:00
Dustin	b51428c363	Merge branch 'loki'	2024-02-17 16:49:35 -06:00
Dustin	2a84d810e0	reload-udev-rules: Add delay before copying files Since systemd starts the reload-udev-rules.service unit as soon as any file in the `/run/containers/udev-rules` directory changes, the `cp` command may start before all of the files have been copied out of the container. If this happens, some of the rules will not get copied to the final path, and thus will not be processed by udev. Togive the container a chance to finish copying all of the files before we process them, we need a bit of a delay. Obviously, this is not a perfect solution, as it could potentially take longer than 250ms to copy the files in some cases, but hopefully those cases are rare enough to not worry about.	2024-02-15 10:08:52 -06:00
Dustin	ffe450cd30	loki: Run Grafana Loki in a container Deploying Loki is pretty straightforward. It just needs a container unit file and a basic YAML configuration file.	2024-02-13 19:54:48 -06:00
Dustin	45285b9c47	host: Add loki0.p.b loki0.pyrocufflink.blue will host [Grafana Loki][0], a log aggregation system. [0]: https://grafana.com/oss/loki/	2024-02-13 16:55:05 -06:00
Dustin	1738e4a1f1	host: Add k8s-aarch64-n{0,1}	2024-02-03 11:16:52 -06:00
Dustin	786145e914	env/prod: Collect common tempates in module In order to simplify the process of adding new template render instructions to all hosts, I've created a list of templates in the `env/prod` module. This way, I only have to add templates there, and all hosts that "inherit" from it will automatically get them.	2024-02-03 11:16:52 -06:00
Dustin	b7f5d4a910	app/ssh: Configure sshd trusted user CA keys Configuring the system-wide trusted user CA key list for sshd(8).	2024-02-03 11:16:52 -06:00
Dustin	afd65ea9b8	host/nvr1: Fix cue package name	2024-02-03 11:13:42 -06:00
Dustin	073f7a6845	host: Add k8s-amd64-n3 k8s-amd64-n3.pyrocufflink.blue is a Kubernetes worker node.	2024-02-03 11:12:55 -06:00
Dustin	f886a1bd8a	sudo: Configure pam_ssh_agent_auth I do not like how Fedora CoreOS configures `sudo` to allow the core user to run privileged processes without authentication. Rather than assign the user a password, which would then have to be stored somewhere, we'll install pam_ssh_agent_auth and configure `sudo` to use it for authentication. This way, only users with the private key corresponding to one of the configured public keys can run `sudo`. Naturally, pam_ssh_agent_auth has to be installed on the host system. We achieve this by executing `rpm-ostree` via `nsenter` to escape the container. Once it is installed, we configure the PAM stack for `sudo` to use it and populate the authorized keys database. We also need to configure `sudo` to keep the `SSH_AUTH_SOCK` environment variable, so pam_ssh_agent_auth knows where to look for the private keys. Finally, we disable the default NOPASSWD rule for `sudo`, if and only if the new configuration was installed.	2024-01-29 09:10:42 -06:00
Dustin	d6751af326	prod/nut: Require both UPS to be online Unfortunately, the automatic transfer switch does not seem to work correctly. When the standby source is a UPS running on battery, it does not switch sources if the primary fails. In other words, when the power is out and both UPS are running on battery, when the first one dies, it will NOT switch to the second one. It has no trouble switching when the second source is mains power, though, which is very strange. I have tried messing with all the settings including nominal input voltage, sensitivity, and frequency tolerence, but none seem to have any effect. Since it is more important for the machines to shut down safely than it is to have an extra 10-15 minutes of runtime during an outage, the best solution for now is to configure the hosts to shut down as soon as the first UPS battery gets low. This is largely a waste of the second UPS, but at least it will help prevent data loss.	2024-01-25 21:12:33 -06:00
Dustin	bd18d3a734	host: Add serial1.p.b serial1.pyrocufflink.blue is a replacement for serial0.p.b. It runs Fedora CoreOS and just has `picocom` and `tmux`.	2024-01-25 20:17:00 -06:00
Dustin	e25aa15eb1	prod/nut: Add user for dustin The `upsrw` command, which is used to set individual UPS configuration parameters like low battery level, etc., needs a username and password to authenticate to `upsd`.	2024-01-21 15:10:56 -06:00
Dustin	0450617ae6	prod/nut: Add upsmon user for burp1	2024-01-19 20:57:47 -06:00
Dustin	48145c3573	nut: Enable Podman auto-update for containers Setting `AutoUpdate=registry` will tell Podman to automatically fetch an updated container image from its corresponding registry and restart the container. The `podman-auto-update.timer` systemd unit needs to be active for this to happen on a schedule.	2024-01-19 20:10:11 -06:00
Dustin	668b79aaac	prod/nut: Add upsmon passwords for gw1, vmhost{0,1}	2024-01-19 19:56:20 -06:00
Dustin	f4938c57e1	prod/nut: Reset password for nvr1 The original password worked, but caused a warning in the `upsd` log: > Ignoring duplicate password for nvr1	2024-01-19 19:32:08 -06:00
Dustin	bb3705939e	nut: Fix upsmon reload hook `upsmon.conf` is used by nut-monitor (`upsmon`) rather than nut-server (`upsd`).	2024-01-19 18:01:42 -06:00
Dustin	36fd137897	nut: Infer role from server name, set commands Since the "primary" `upsmon` is always (for our purposes) running on the same host as `upsd`, there's no reason to specify both values. All systems need a shutdown command; one is not set by default. The primary system is the only one that should send notifications.	2024-01-19 17:57:20 -06:00
Dustin	caccffcb65	nut: split out template for sysusers.d config Hosts that run `upsmon` but not `upsd` still need the nut user.	2024-01-19 17:21:23 -06:00
Dustin	ad42c2d883	nvr1: Add instructions to configure upsmon nvr1.pyrocufflink.blue will run `upsmon` so it can shut itself down safely when the power goes out.	2024-01-19 16:57:47 -06:00
Dustin	a919a9f94b	nut/monitor: Fix tmpfs mount syntax `dest` is not a valid option for the `--mount` argument to `podman`. To specify where the target path, only `target`, `destination`, and `dst` are valid.	2024-01-19 16:42:56 -06:00
Dustin	fb74f0e81c	nut: Configure upsmon `upsmon` is the component of NUT that tracks the status of UPSs and reacts to their changing by sending notifications and/or shutting down the system. It is a networked application that can run on any system; it can run on a different system than `upsd`, and indeed can run on multiple systems simultaneously. Each system that runs `upsmon` will need a username and password for each UPS it will monitor. Using the CUE [function pattern][0], I've made it pretty simple to declare the necessary values under `nut.monitor`. [0]: https://cuetorials.com/patterns/functions/	2024-01-19 08:52:14 -06:00
Dustin	227ce8cfcf	collectd: Bind-mount journal log socket collectd logs to syslog, so its output is lost when it's running in a container. We can capture messages from it by mounting the journald syslog socket into the container.	2024-01-18 20:35:22 -06:00
Dustin	f1a55e3d5c	collectd: Fix / bind mount directive	2024-01-18 20:27:25 -06:00
Dustin	ec4b640170	reload-udev-rules: Ensure rules.d directory exists The `/run/udev/rules.d` directory may not always exist, especially at boot. We need to ensure that it does before we try to copy rules exported by containers into it, or the unit will fail.	2024-01-18 20:01:06 -06:00
Dustin	714df85183	collectd: Bind mount / into container Even with collectd configured to report filesystem usage by device, it still only reports filesystems that are mounted (in its namespace). Thus, in order for it to report filesystems like `/boot`, these need to be mounted in the container.	2024-01-18 19:58:11 -06:00
Dustin	d3338a125b	nut0: Configure collectd	2024-01-17 17:35:21 -06:00

1 2

69 Commits (1db158c1505cd67538cf76700373529b7398f0ad) All Branches Search

69 Commits (1db158c1505cd67538cf76700373529b7398f0ad)

All Branches