Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap. As such, they do not need to be
listed explicitly in the static targets list.
I'm not using Matrix for anything anymore, and it seems to have gone
offline. I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them). I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.
The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*. The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server. It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.
[0]: https://github.com/prometheus-community/postgres_exporter
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS. We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
I don't like having alerts sent by e-mail. Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time. They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly). I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.
Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager. Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other. There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format. Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.
[0]: https://github.com/alexbakker/alertmanager-ntfy
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages. Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.
The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*. It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it. I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
The *virt* plugin for *collectd* sets `instance` to the name of the
libvirt domain the metric refers to. This makes it so there is no label
identifying which host the VM is running on. Thus, if we want to
classify metrics by VM host, we need to add that label explicitly.
Since the `__address__` label is not available during metric relabeling,
we need to store it in a temporary label, which gets dropped at the end
of the relabeling phase. We copy the value of that label into a new
label, but only for metrics that match the desired metric name.
Scraping the public DNS servers doesn't work anymore since the firewall
routes traffic through Mullvad. Pinging public cloud providers should
give a pretty decent indication of Internet connectivity. It will also
serve as a benchmark for the local DNS performance, since the names will
have to be resolved.
By default, the `instance` label for discovered metrics targets is set
to the scrape address. For Kubernetes pods, that is the IP address and
port of the pod, which naturally changes every time the pod is recreated
or moved. This will cause a high churn rate for Longhorn manager pods.
To avoid this, we set the `instance` label to the name of the node the
pod is running on, which will not change because the Longhorn manager
pods are managed by a DaemonSet.
Each Longhorn manager pod exports metrics about the node on which it is
running. Thus, we have to scrape every pod to get the metrics about the
whole ecosystem.
Since *mtrcs0.pyrocufflink.blue* (the Metrics Pi) seems to be dying,
I decided to move monitoring and alerting into Kubernetes.
I was originally planning to have a single, dedicated virtual machine
for Victoria Metrics and Grafana, similar to how the Metrics Pi was set
up, but running Fedora CoreOS instead of a custom Buildroot-based OS.
While I was working on the Ignition configuration for the VM, it
occurred to me that monitoring would be interrupted frequently, since
FCOS updates weekly and all updates require a reboot. I would rather
not have that many gaps in the data. Ultimately I decided that
deploying a cluster with Kubernetes would probably be more robust and
reliable, as updates can be performed without any downtime at all.
I chose not to use the Victoria Metrics Operator, but rather handle
the resource definitions myself. Victoria Metrics components are not
particularly difficult to deploy, so the overhead of running the
operator and using its custom resources would not be worth the minor
convenience it provides.