The Unifi controller consists of three containerized processes:
* Unifi Network itself
* unifi_exporter for monitoring and metrics via Prometheus
* Caddy for HTTPS
_unifi_exporter_ is really the only component with any configuration.
Unfortunately, it mixes secret and non-secret data in a single YAML
file, which makes it impossible to use `yaml.Marshal` to render the
configuration directly from the CUE source.
The serial terminal server ("serterm") is a collection of scripts that
automate launching multiple `picocom` processes, one per USB-serial
adapter connected to the system. Each `picocom` process has its own
window in a `tmux` session, which is accessible via SSH on a dedicated
port (20022). Clients connecting to that SSH server will be
automatically attached to the `tmux` session, allowing them to access
the serial terminal server quickly and easily. The SSH server only
allows public-key authentication, so the authorized keys have to be
pre-configured.
In addition to automatically launching `picocom` windows for each serial
port when the terminal server starts, ports that are added (hot-plugged)
while the server is running will have windows created for them
automatically, by way of a udev rule.
Each `picocom` process is configured to log communications with its
respective serial port. This may be useful, for example, to find
diagnostic messages that may not be captured by the `tmux` scrollback
buffer.
Grafana Loki explicitly eschews built-in authentication. In fact, its
[documentation][0] states:
> Operators are expected to run an authenticating reverse proxy in front
> of your services.
While I don't really want to require authentication for agents sending
logs, I definitely want to restrict querying and viewing logs to trusted
users.
There are _many_ reverse proxy servers available, and normally I would
choose _nginx_. In this case, though, I decided to try Caddy, mostly
because of its built-in ACME support. I wasn't really happy with how
the `fetchcert` system turned out, particularly using the Kubernetes API
token for authentication. Since the token will eventually expire, it
will require manual intervention to renew, thus mostly defeating the
purpose of having an auto-renewing certificate. So instead of using
_cert-manager_ to issue the certificate and store it in Kubernetes, and
then having `fetchcert` download it via the Kubernetes API, I set up
_step-ca_ to handle issuing the certificate directly to the server. When
Caddy starts up, it contacts _step-ca_ via ACME and handles the
challenge verification automatically. Further, it will automatically
renew the certificate as necessary, again using ACME.
I didn't spend a lot of time optimizing the Caddy configuration, so
there's some duplication there (i.e. the multiple `reverse_proxy`
statements), but the configuration works as desired. Clients may
provide a certificate, which will be verified against the trusted issuer
CA. If the certificate is valid, the client may access any Loki
resource. Clients that do not provide a certificate can only access the
ingestion path, as well as the "ready" and "metrics" resources.
[0]: https://grafana.com/docs/loki/latest/operations/authentication/
All the stand-alone FCOS hosts now have Promtail running, forwarding
_systemd_ journal messages to Grafana Loki. The Kubernetes nodes will
have Promtail deployed as a Kubernetes pod.
I would really like to come up with a way to define variables for groups
of hosts, so that I do not have to add `promtail: prod.#promtail` to
every host's values file individually...
[Promtail][0] is the log collection agent for Grafana Loki. It reads
logs from various locations, including local files and the _systemd_
journal and sends them to Loki via HTTP.
Loki configuration is a highly-structured YAML document. Thus, instead
of using Tera template syntax for loops, conditionals, etc., we can use
the full power of CUE to construct the configuration. Using the
`Marshal` function from the built-in `encoding/yaml` package, we
serialize the final configuration structure as a string and write it
verbatim to the configuration file.
I have modeled most of the Promtail configuration schema in the
`du5t1n.me/cfg/app/promtail/schema` package. Having the schema modeled
will ensure the generated configuration is valid during development
(i.e. `cue export` will fail if it is not), which will save time pushing
changes to machines and having Loki complain.
The `#promtail` "function" in `du5t1n.me/cfg/env/prod` makes it easy to
build our desired configuration. It accepts an optional `#scrape`
field, which can be used to provide specific log scraping definitions.
If it is unspecified, the default configuration is to scrape the systemd
journal. Hosts with additional needs can supply their own list,
probably including the `promtail.scrape.journal` object in it to get the
default journal scrape job.
[0]: https://grafana.com/docs/loki/latest/send-data/promtail/
Before going into production with Grafana Loki, I want to set it up to
use TLS. To that end, I have configured _cert-manager_ to issue it a
certificate, signed by _DCH CA_. In order to use said certificate,
we need to configure `fetchcert` to run on the Loki server.
In order to simplify the process of adding new template render
instructions to all hosts, I've created a list of templates in the
`env/prod` module. This way, I only have to add templates there, and
all hosts that "inherit" from it will automatically get them.
I do not like how Fedora CoreOS configures `sudo` to allow the *core*
user to run privileged processes without authentication. Rather than
assign the user a password, which would then have to be stored
somewhere, we'll install *pam_ssh_agent_auth* and configure `sudo` to
use it for authentication. This way, only users with the private key
corresponding to one of the configured public keys can run `sudo`.
Naturally, *pam_ssh_agent_auth* has to be installed on the host system.
We achieve this by executing `rpm-ostree` via `nsenter` to escape the
container. Once it is installed, we configure the PAM stack for
`sudo` to use it and populate the authorized keys database. We also
need to configure `sudo` to keep the `SSH_AUTH_SOCK` environment
variable, so *pam_ssh_agent_auth* knows where to look for the private
keys. Finally, we disable the default NOPASSWD rule for `sudo`, if
and only if the new configuration was installed.
`upsmon` is the component of NUT that tracks the status of UPSs and
reacts to their changing by sending notifications and/or shutting down
the system. It is a networked application that can run on any system;
it can run on a different system than `upsd`, and indeed can run on
multiple systems simultaneously.
Each system that runs `upsmon` will need a username and password for
each UPS it will monitor. Using the CUE [function pattern][0], I've
made it pretty simple to declare the necessary values under
`nut.monitor`.
[0]: https://cuetorials.com/patterns/functions/
A bunch of stuff that wasn't schema definitions ended up in the `schema`
package. Rather than split values up in a bunch of top-level packages,
I think it would be better to have a package-per-app model.
Although KCL is unquestionably a more powerful language, and maps more
closely to my mental model of how host/environment/application
configuration is defined, the fact that it doesn't work on ARM (issue
982]) makes it a non-starter. It's also quite slow (owing to how it
compiles a program to evaluate the code) and cumbersome to distribute.
Fortunately, `tmpl` doesn't care how the values it uses were computed,
so we freely change configuration languages, so long as whatever we use
generates JSON/YAML.
CUE is probably a lot more popular than KCL, and is quite a bit simpler.
It's more restrictive (values cannot be overridden once defined), but
still expressive enough for what I am trying to do (so far).