The _updatebot_ has been running with an old configuration for a while,
so while it was correctly identifying updates to ZWaveJS UI and
Zigbee2MQTT, it was generating overrides for the incorrect OCI image
names.
Buildroot jobs really benefit from having a persistent workspace volume
instead of an ephemeral one. This way, only the packages, etc. that
have changed since the last build need to be built, instead of the whole
toolchain and operating system.
As with AlertManager, the point of having multiple replicas of `vmagent`
is so that one is always running, even if the other fails. Thus, we
want to start the pods in parallel so that if the first one does not
come up, the second one at least has a chance.
If something prevents the first AlertManager instance from starting, we
don't want to wait forever for it before starting the second. That
pretty much defeats the purpose of having two instances. Fortunately,
we can configure Kubernetes to bring up both instances simultaneously by
setting the pod management policyo to `Parallel`.
We also don't need a 4 GB volume for AlertManager; even 500 MB is
way too big for the tiny amount of data it stores, but that's about the
smallest size a filesystem can be.
The `cert-exporter` is no longer needed. All websites manage their own
certificates with _mod_md_ now, and all internal applications that use
the wildcard certificate fetch it directly from the Kubernetes Secret.
_bw0.pyrocufflink.blue_ has been decommissioned since some time, so it
doesn't get backed up any more. We want to keep its previous backups
around, though, in case we ever need to restore something. This
triggers the "no recent backups" alert, since the last snapshot is over
a week old. Let's ignore that hostname when generating this alert.
The `vmagent` needs a place to spool data it has not yet sent to
Victoria Metrics, but it doesn't really need to be persistent. As long
as all of the `vmagent` nodes _and_ all of the `vminsert` nodes do not
go down simultaneously, there shouldn't be any data loss. If they are
all down at the same time, there's probably something else going on and
lost metrics are the least concerning problem.
The _dynk8s-provisioner_ only needs writable storage to store copies of
the AWS SNS notifications it receives for debugging purposes. We don't
need to keep these around indefinitely, so using ephemeral node-local
storage is sufficient. I actually want to get rid of that "feature"
anyway...
Although Firefly III works on a Raspberry Pi, a few things are pretty
slow. Notably, the search feature takes a really long time to return
any results, which is particularly annoying when trying to add a receipt
via the Receipts app. Adding a node affinity rule to prefer running on
an x86_64 machine will ensure that it runs fast whenever possible, but
can fall back to running on a Rasperry Pi if necessary.
The "cron" container has not been working correctly for some time. No
background tasks are getting run, and this error is printed in the log
every minute:
> `Target class [db.schema] does not exist`
It turns out, this is because of the way the PHP `artisan` tool works.
It MUST be able to write to the code directory, apparently to build some
kind of cache. There may be a way to cache the data ahead of time, but
I haven't found it yet. For now, it seems the only way to make
Laravel-based applications run in a container is to make the container
filesystem mutable.
Music Assistant doesn't expose any metrics natively. Since we really
only care about whether or not it's accessible, scraping it with the
blackbox exporter is fine.
In order to allow access to Authelia from outside the LAN, it needs to
be able to handle the _pyrocufflink.net_ domain in addition to
_pyrocufflink.blue_. Originally, this was not possible, as Authelia
only supported a single cookie/domain. Now that it supports multiple
cookies, we can expose both domains.
The main reason for doing this now is use Authelia's password reset
capability for Mom, since she didn't have a password for her Nextcloud
account that she's just begun using.
I wrote a Thunderbird add-on for my work computer that periodically
exports my entire DTEX calendar to a file. Unfortunately, the file it
creates is not directly usable by the kitchen screen server currently;
it seems to use a time zone identifier that `tzinfo` doesn't understand:
```
Error in background update:
Traceback (most recent call last):
File "/usr/local/kitchen/lib64/python3.12/site-packages/kitchen/service/agenda.py", line 19, in _background_update
await self._update()
File "/usr/local/kitchen/lib64/python3.12/site-packages/kitchen/service/agenda.py", line 34, in _update
calendar = await self.fetch_calendar(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/kitchen/lib64/python3.12/site-packages/kitchen/service/caldav.py", line 39, in fetch_calendar
return icalendar.Calendar.from_ical(r.text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/kitchen/lib64/python3.12/site-packages/icalendar/cal.py", line 369, in from_ical
_timezone_cache[component['TZID']] = component.to_tz()
^^^^^^^^^^^^^^^^^
File "/usr/local/kitchen/lib64/python3.12/site-packages/icalendar/cal.py", line 659, in to_tz
return cls()
^^^^^
File "/usr/local/kitchen/lib64/python3.12/site-packages/pytz/tzinfo.py", line 190, in __init__
self._transition_info[0])
~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
```
It seems to work fine in Nextcloud, though, so the work-around is to
import it as a subscription in Nextcloud and then read it from there,
using Nextcloud as a sort of proxy.
There is not (currently) an aarch64 build of the kitchen screen server,
so we need to force the pod to run on a x86_64 node. This seems a good
candidate for running on a Raspberry Pi, so I should go ahead and build
a multi-arch image.
_democratic-csi_ can also dynamically resize Synology iSCSI LUNs when
PVC resource requests increase. This requires enabling the external
resizer in the controller pod and marking the StorageClass as supporting
resize.
The _democratic-csi_ controller can create Synology LUN snapshots based
on VolumeSnapshot resources. This feature can be used to e.g. create
data snapshots before upgrades, etc.
Deploying _democratic-csi_ to manage PersistentVolumeClaim resources,
mapping them to iSCSI volumes on the Synology.
Eventually, all Longhorn-managed PVCs will be replaced with Synology
iSCSI volumes. Getting rid of Longhorn should free up a lot of
resources and remove a point of failure from the cluster.
This hacky work-around is no longer necessary, as I've figured out why
the players don't (always) get rediscovered when the server restarts.
It turns out, Avahi on the firewall was caching responses to the mDNS PTR
requests Music Assistant makes. Rather than forward the requests to the
other VLANs, it would respond with its cached information, but in a way
that Music Assistant didn't understand. Setting `cache-entries-max` to
`0` in `avahi-daemon.conf` on the firewall resolved the issue.
This reverts commit 42a7964991.
I haven't fully determined why, but when the Music Assistant server
restarts, it marks the _shairport-sync_ players as offline and will not
allow playing to them. The only way I have found to work around this is
to restart the players after the server restarts. As that's pretty
cumbersome and annoying, I naturally want to automate it, so I've
created this rudimentary synchronization technique using _ntfy_: each
player listens for notifications on a specific topic, and upon receiving
one, tells _shairport-sync_ to exit. With the `Restart=` property
configured on the _shairport-sync.service_ unit, _systemd_ will restart
the service, which causes Music Assistant to discover the player again.
_Music Assistant_ is pretty straightforward to deploy, despite
upstream's apparent opinion otherwise. It just needs a small persistent
volume for its media index and customization. It does need to use the
host network namespace, though, in order to receive multicast
announcements from e.g. AirPlay players, as it doesn't have any way of
statically configuring them.
Jenkins needs to be able to patch the Deployment to trigger a restart
after it builds a new container image for _dch-webhooks_.
Note that this manifest must be applied on its own **without
Kustomize**. Kustomize seems to think the `dch-webhooks` in
`resourceNames` refers to the ConfigMap it manages and "helpfully"
renames it with the name suffix hash. It's _not_ the ConfigMap, though,
but there's not really any way to tell it this.
Without a node affinity rule, Kubernetes applies equal weight to the
"big" x86_64 nodes and the "small" aarch64 ones. Since we would really
rather Piper and Whisper _not_ run on a Raspberry Pi, we need the rule
to express this.
As it turns out, although Home Assistant itself works perfectly fine on
a Raspberry Pi, Piper and Whisper do not. They are _much_ too slow to
respond to voice commands.
This reverts commit 32666aa628.
With the introduction of the two new Raspberry Pi nodes that I intend to
be used for anything that supports running on aarch64, I'm eliminating
the `du5t1n.me/machine=raspberrypi` taint. It no longer makes sense, as
the only node that has it is the Zigbee/ZWave controller. Having
dedicated taints for those roles is much more clear.
As it turns out, it's not possible to reuse a YAML anchor. At least in
Rust's `serde_yaml`, only the final definition is used. All references,
even those that appear before the final definition, use the same
definition. Thus, each application that refers to its own URL in its
match criteria needs a unique anchor.
_Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we
have to scrape them via the Blackbox Exporter.
Paperless-ngx only exposes metrics via Flower, but since it runs in the
same container as the main application, we can assume that if the former
is unavailable, the latter is as well.
The Kubernetes root CA certificate is stored in a ConfigMap named
`kube-root-ca.crt` in every namespace. The _host-provisioner_ needs to
be able to read this ConfigMap in order to prepare control plane nodes,
as it is used by HAProxy to check the health of the API servers running
on each node.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
We don't want to pull public container images that already exist. This
creates prevents pods from starting if there is any connectivity issue
with the upstream registry.
Home Assistant has started sending the full sensor values for weather
metrics to Prometheus, even though their precision is way beyond their
accuracy. We don't need to see 4+ decimal points for these on the
Kitchen display, so let's round the values when we query.
The `scrape-collectd` ConfigMap in the `default` namespace is used by
Victoria Metrics to identif the hosts from which it should scrape
collectd metrics. When deploying new machines that are _not_ part of
the Kubernetes cluster, we need to explicitly add them to this list.
The _host-provisioner_ can do this with an Ansible task, but it needs
the appropriate permissions to do so.
Ansible playbook running as Jenkins jobs need to be able to access the
Secret resources containing certificates issued by _cert-manager_ in
order to install them on managed nodes. Although not all jobs do this
yet, eventually, the _cert-exporter_ will no longer be necessary, as the
_certs.git_ repository will not be used anymore.
We don't want to hard-code a namespace for the `ssh-known-hosts`
ConfigMap because that makes it less useful for other projects besides
Jenkins. Instead, we omit the namespace specification and allow
consumers to specify their own.
The _jenkins_ project doesn't have a default namespace, since it
specifies resources in the `jenkins` and `jenkins-jobs` namespaces, we
need to create a sub-project to set the namespace for the
`ssh-known-hosts` ConfigMap.
Docker Hub has blocked ("rate limited") my IP address. Moving as much
as I can to use images from other sources. Hopefully they'll unblock me
soon and I can deploy a caching proxy.
The _k8s-worker_ Ansible role in the configuration policy now uses the
Kubernetes API to create bootstrap tokens for adding worker nodes to the
cluster. For this to work, the pod running the host-provisioner must be
associated with a service account that has the correct permissions to
create secrets and access the `cluster-info` ConfigMap.
Whisper now needs a writable location for downloading models from
Hugging Face Hub. The default location is `~/.cache/huggingface/hub`,
but this is not writable in our container. The path can be controlled
via one of several environment variables, but we're setting `HF_HOME` as
it is sets the top level directory for several related paths.
Scraping metrics from the Kubernetes API server has started taking 20+
seconds recondly. Until I figure out the underlying cause, I'm
increasing the scrape timeout so that the _vmagent_ doesn't give up and
report the API server as "down."
I've completely blocked all outgoing unencrypted DNS traffic at the
firewall now, which prevents _cert-manager_ from using its default
behavior of using the authoritative name servers for its managed domains
to check poll for ACME challenge DNS TXT record availability.
Fortunately, it has an option to use a recursive resolver (i.e. the
network-provided DNS server) instead.
`mqtt2vl` is a relatively simple service I developed to read log
messages from an MQTT topic (i.e. those published by ESPHome devices)
and stream them to Victoria Logs over HTTPS.
The legacy alerting feature (which we never used) has been deprecated
for a long time and removed in Grafana 11. The corresponding
configuration block must be removed from the config file or Grafana will
not start.
Authelia made breaking changes to the OIDC issuer configuration in 4.39,
specifically around what claims are present in identity tokens. Without
a claims policy set, clients will _not_ get the correct claims, which
breaks authentication and authorization in many cases (including
Kubernetes).
While I was fixing that, I went ahead and fixed a few of the other
deprecation warnings. There are still two that show up at startup, but
fixing them will be a bit more involved, it seems.
This CronJob schedules a periodic run of `restic forget`, which deletes
snapshots according to the specified retention period (14 daily, 4
weekly, 12 monthly).
This task used to run on my workstation, scheduled by a systemd timer
unit. I've kept the same schedule and retention period as before. Now,
instead of relying on my PC to be on and awake, the cleanup will occur
more regularly. There's also the added benefit of getting the logs into
Loki.
Occasionally, some documents may have odd rendering errors that
prevent the archival process from working correctly. I'm less concerned
about the archive document than simply having a centralized storage for
paperwork, so enabling this "continue on soft render error" feature is
appropriate. As far as I can tell, it has no visible effect for the
documents that could not be imported at all without it.
*unifi3.pyrocufflink.blue* has been replaced by
*unifi-nuptials.host.pyrocufflink.black*. The former was the last
Fedora CoreOS machine in use, so the entire Zincati scrape job is no
longer needed.
This is a custom-built application for managing purchase receipts. It
integrates with Firefly III to fill some of the gaps that `xactmon`
cannot handle, such as restaurant bills with tips, gas station
purchases, purchases with the HSA debit card, refunds, and deposits.
Photos of receipts can be taken directly within the application using
the User Media Web API, or uploaded as existing files. Each photo is
associated with transaction data, including date, vendor, amount, and
general notes. These data are also synchronized with Firefly whenever
possible.
By default, the _pyrocufflink_ Ansible inventory plugin ignores VMs
whose names begin with `test-`. This prevents Jenkins from failing to
apply policy to machines that it should not be managing. The host
provisioner job, though, should apply policy to those machines, so we
need to disable that filter.
The *dch-webhooks* user is used by *dch-webhooks* in order to publish
host information when a new machine triggers its _POST /host/online_
webhook. It therefore needs to be able to write to the
_host-provisioner_ queue (via the default exchange).
The *host-provisioner* user is used by the corresponding consumer to
receive the host information and initiate the provisioning process.
The *dch-webhooks* server now has a _POST /host/online_ hook that can
be triggered by a new machine when it first comes online. This hook
starts an automatic provisioning process by creating a Kubernetes Job
to run Ansible and publishing information about the host to provision
via AMQP. Thus, the server now needs access to the Kubernetes API in
order to create the Job and access to RabbitMQ in order to publish the
task parameters.
The contents of the DCH Root CA will not change, so it does not make
sense to enable the hash suffix feature for this ConfigMap. Without it,
the ConfigMap name is predictable and can be used outside of a Kustomize
project.
The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL
archival has failed, it will increase and never return to `0`. To
ensure the alert is resolved once the WAL archival process recovers, we
need to use the `increase` function to turn it into a gauge. Finally,
we aggregate that gauge with `max_over_time` to keep the alert from
flapping if the WAL archive occurs less frequently than the scrape
interval.
We're using the Alpine variant of the Vaultwarden container images,
since the default variant is significantly larger and we do not need any
of the extra stuff it includes.
[ARA Records Ansible][0] is a results storage system for Ansible. It
provides a convenient UI for tracking Ansible playbooks and tasks. The
data are populated by an Ansible callback plugin.
ARA is a fairly simple Python+Django application. It needs a database
to store Ansible results, so we've connected it to the main PostgreSQL
database and configured it to connect and authenticate using mTLS.
Rather than mess with managing and distributing a static password for
ARA clients, I've configured Autheliad to allow anonymous access to
post data to the ARA API from within the private network or the
Kubernetes cluster. Access to the web UI does require authentication.
[0]: https://ara.recordsansible.org/
At some point this week, the front porch camera stopped sending video.
I'm not sure exactly what happened to it, but Frigate kept logging
"Unable to read frames from ffmpeg process." I power-cycled the camera,
which resolved the issue.
Unfortunately, no alerts were generated about this situation. Home
Assistant did not consider the camera entity unavailable, presumably
because Frigate was still reporting stats about it. Thus, I missed
several important notifications. To avoid this in the future, I have
enabled the "Camera FPS" sensors for all of the cameras in Home
Assistant, and added this alert to trigger when the reported framerate
is 0.
I really also need to get alerts for log events configured, as that
would also indicated there was an issue.
Zigbee2MQTT needs to be able to read and write to the serial device for
the ConBee II USB controller. I'm not exactly sure what changed, or how
it was able to access it before the recent update.
The _dialout_ group has GID 18 on Fedora.
Vaultwarden requires basically no configuration anymore. Older versions
needed some environment variables for configuring the WebSocket server,
but as of 1.31, WebSockets are handled by the same server as HTTP, so
even that is not necessary now. The only other option that could
potentially be useful is `ADMIN_TOKEN`, but it's optional. For added
security, we can leave it unset, which disables the administration
console; we can set it later if/when we actually need that feature.
Migrating data from the old server was pretty simple. The database is
pretty small, and even the attachments and site icons don't take up much
space. All-in-all, there was only about 20 MB to move, so the copy only
took a few seconds.
Aside from moving the Vaultwarden server itself, we will also need to
adjust the HAProxy configuration to proxy requests to the Kubernetes
ingress controller.
Jenkins that build Gentoo-based systems, like Aimee OS, need a
persistent storage volume for the Gentoo ebuild repository. The Job
initially populates the repository using `emerge-webrsync`, and then the
CronJob keeps it up-to-date by running `emaint sync` daily.
In addition to the Portage repository, we also need a volume to store
built binary packages. Jenkins job pods can mount this volume to make
binary packages they build available for subsequent runs.
Both of these volumes are exposed to use cases outside the cluster using
`rsync` in daemon mode. This can be useful for e.g. local builds.
The Raspberry Pi in the kitchen now has Firefox installed so we can use
it to control Home Assistant. By listing its IP address as a trusted
network, and assigning it a trusted user, it can access the Home
Assistant UI without anyone having to type a password. This is
particularly important since there's no keyboard (not even an on-screen
virtual one).
Moving the `trusted_networks` auth provider _before_ the `homeassistant`
provider changes the login screen to show a "log in as ..." dialog by
default on trusted devices. It does not affect other devices at all,
but it does make the initial login a bit easier on kiosks.
We don't need a notification about paperless not scheduling email tasks
every time there is a gap in the metric. This can happen in some
innocuous situations like when the pod restarts or if there is a brief
disruption of service. Using the `absent_over_time` function with a
range vector, we can have the alert fire only if there have been no
email tasks scheduled within the last 12 hours.
It turns out this alert is not very useful, and indeed quite annoying.
Many servers can go for days or even weeks with no changes, which is
completely normal.
Since transitioning to externalIPs for TCP services, it is no longer
possible to use the HTTP.01 ACME challenge to issue certificates for
services hosted in the cluster, because the ingress controller does not
listen on those addresses. Thus, we have to switch to using the DNS.01
challenge. I had avoided using it before because of the complexity of
managing dynamic DNS records with the Samba AD server, but this was
actually pretty to work around. I created a new DNS zone on the
firewall specifically for ACME challenges. Names in the AD-managed zone
have CNAME records for their corresponding *_acme-challenge* labels
pointing to this new zone. The new zone has dynamic updates enabled,
which _cert-manager_ supports using the RFC2136 plugin.
For now, this is only enabled for _rabbitmq.pyrocufflink.blue_. I will
transition the other names soon.
Since the IP address assigned to the ingress controller is now managed
by keepalived and known to Kubernetes, the network policy needs to allow
access to it by pod namespace rather than IP address. It seems that the
former takes precedence over the latter, so even though the IP address
was explicitly allowed, traffic was not permitted because it was
destined for a Kubernetes service that was not.
Home Assistant supports unauthenticated access for certain clients using
its _trusted_network_ auth provider. With this configuration, we allow
the desk panel to automatically sign in as the _kiosk_ user, but all
other clients must authenticate normally.
The new machines have names in the _pyrocufflink.black_ zone. We need
to trust the SSHCA certificate to sign keys for these names in order to
connect to them and manage them with Ansible.
Since _ingress-nginx_ no longer runs in the host network namespace,
traffic will appear to come from pods' internal IP addresses now.
Similarly, the network policy for Invoice Ninja needs to be updated to
allow traffic _to_ the ingress controllers' new addresses.
Clients outside the cluster can now communicate with RabbitMQ directly
on port 5671 by using its dedicated external IP address. This address
is automatically assigned to the node where RabbitMQ is running by
`keepalived`.
Clients outside the cluster can now communicate with Mosquitto directly
on port 8883 by using its dedicated external IP address. This address
is automatically assigned to the node where Mosquitto is running by
`keepalived`.
Now that we have `keepalived` managing the "virtual" IP address for the
ingress controller, we can change _ingress-nginx_ to run as a Deployment
rather than a DaemonSet. It no longer needs to use the host network
namespace, as `kube-proxy` will route all traffic sent to the configured
external IP address to the controller pods. Using the _Local_ external
traffic policy disables NAT, so incoming traffic is seen by the
nginx unmodified.
Running `keepalived` as a DaemonSet will allow managing floating
"virtual" IP addresses for Kubernetes services with configured external
IP addresses. The main services we want to expose outside the cluster
are _ingress-nginx_, Mosquitto, and RabbitMQ. The `keepalived` cluster
will negotiate using the VRRF protocol to determine which node should
have each external address. Using the process tracking feature of
`keepalived`, we can steer traffic directly to the node where the target
service is running.
I've created new worker nodes that are dedicated to running Longhorn
replicas. These nodes are tainted with the
`node-role.kubernetes.io/longhorn` taint, so no regular pods will be
scheduled there by default. Longhorn pods thus needs to be configured
to tolerate that taint, and to be scheduled on nodes with the
similarly-named label.
This will make it easier to "blow away" the RabbitMQ data volume on the
occasions when it gets into a weird state. Simply scale the StatefulSet
down to 0 replicas, delete the PVC, then scale back up. Kubernetes will
handle creating a new PVC automatically.
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`. The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application. This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not. To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
The _fleetlock_ server drains all pods from a node before allocating the
reboot lock to that node. Unfortunately, it doesn't actually wait for
those pods to be completely evicted. If some pods take too long to shut
down, they may get stuck in `Terminating` state once the machine starts
rebooting. This makes it so those pods cannot be replaced on another
node with the original one is offline, which pretty much defeats the
purpose of using Fleetlock in the first place.
It seems upstream has abandoned this project, as there is an open [Pull
Request][0] to fix this issue that has so far been ignored.
Fortunately, building a new container image containing the patch is easy
enough, so we can run our own patched build.
[0]: https://github.com/poseidon/fleetlock/pull/271
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month. We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati. We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc. This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application. The GUI displays the results of failed
tasks when they occur. It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
https://20125.home/ is the URL the Status Android application loads in
its main WebView. This site is powered by a server that generates a
custom page showing the status of our self-hosted applications, based on
alerts retrieved from the AlertManager API.
Android WebView does not allow cleartext HTTP connections. It does,
however, allow connecting an HTTPS server and ignoring the certificate
it presents, which is effectively the same thing. Thus, we generate a
self-signed certificate for the Ingress for this site.
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines. This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.
To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio. By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.
Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds. Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible. Thus, it does not need to be
explicitly listed as a scrape target.
For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process). As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past. To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
The Gotenberg container image uses UID 1001 for the _gotenberg_ user.
Using any other UID number, even when the home directory is set and
owned by that UID, results in random issues, especially when using
LibreOffice conversions.
The Paperless-ngx ecosystem consists of several services. Defining the
resources for each service in separate manifest files will make
maintenance a little bit easier.
Longhorn uses a special Secret resource to configure the backup target.
This secret includes the credentials and CA certificate for accessing
the MinIO S3 service.
Longhorn must be configured to use this Secret by setting the
`backup-target-credential-secret` setting to
`minio-backups-credentials`.
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore. Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason. I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them. Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
Using Kustomize, we can define the configuration file separately from
the Kubernetes resources, and use `configMapGenerators` to generate the
ConfigMap for it. Additionally, this will make it possible to update
_ntfy_ using `updatebot`.
Tabitha wants to be able to accept Apple Pay payemnts via stripe, but
this requires an additional "domain verification" step. Apple needs to
make an HTTP request to the domain owned by the vendor, which in the
case of Invoice Ninja, must be the "app URL." Unfortunately, there
does not appear to be a way to tell Apple/Stripe/IN to use the client
portal domain or any other domain besides the app URL. Therefore, we
need to expose Invoice Ninja to the Internet under the public
_pyrocufflink.net_ domain, rather than the internal _pyrocufflink.blue_.
Let's run `updatebot` on Saturday morning, so I can apply the changes
over the weekend if I have time. If I don't, there's no harm in having
the PRs open for a few days until I can get to it during the week.
Restic backups are now stored in MinIO on _chromie.pyrocufflink.blue_.
All data have been migrated from _burp1.p.b_, which is being
decommissioned.
The instance of MinIO on _chromie_ uses a certificate signed by DCH CA,
rather than the _pyrocufflink.blue_ wildcard certificate signed by
ZeroSSL. As such, we need to configure `restic` to trust the DCH Root
CA certificate in order to use the MinIO S3 API.
The latest version of `updatebot` has two major changes:
1. Projects can encompass multiple images, eliminating the need for
multiple configuration files and CronJobs. Projects are now defined
in a YAML documen, since the data structure is very nested and is
cumbersome to express in TOML.
2. Pull requests can now include a diff of the resources that will
change if the PR is merged. This requires the `kubectl` and `diff`
programs (which are not currently included in the _updatebot_
container image, so we bind-mount them from the host) and permission
to compare the local manifests using the Kubernetes API. Oddly,
computing the diff requires permission to use the PATCH method, even
though the client is not requesting any changes. This is apparently
a long-standing bug ([issue #981][0]) that may or may not ever be
fixed.
[0]: https://github.com/kubernetes/kubectl/issues/981
`updatebot` is a script I wrote that automatically opens Gitea Pull
Requests to update container image references in Kubernetes resource
manifests. It checks Github or Docker Hub for the latest release and
updates manifests or Kustommization configuration files to point to the
current version. It then commits the changes and opens a pull request
in Gitea. When combined with ArgoCD automatic synchronization, this
makes updating Kubernetes-deployed applications as simple as clicking
the merge button in the Gitea PR.
To start with, we'll automate Home Assistant upgrades this way.
This template sensor will be migrated to a helper, since Home Assitant
removed the `forecast` attribute of weather sensors and now requires
calling an action (service) to get those data.
Now that the reverse proxy that handles requests from the Internet uses
TLS pass-through, the Ingress for _ntfy_ needs to recognize both the
internal and external name.
Now that the reverse proxy for Internet-facing sites uses TLS
passthrough, the certificate for the _darkchestofwonders.us_ Ingress
needs to be correct. Since Ingress resources can only use either the
default certificate (_*.pyrocufflink.blue_) or a certificate from their
same namespace, we have to move the Certificate and its corresponding
Secret into the _websites_ namespace. Fortunately, this is easy enoug
to do, by setting the appropriate annotations on the Ingress.
To keep the existing certificate (until it expires), I moved the Secret
manually:
```sh
kubectl get secret dcow-cert -o yaml | grep -v namespace | kubectl create -n websites -f -
```
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap. As such, they do not need to be
listed explicitly in the static targets list.
There's obviously a bug or something in `mqttmarionette` because it
occasionally gets "stuck" in a state where it is running but does
not reconnect to the MQTT broker. In such situations, it has to be
restarted (and even then it doesn't shut down correctly but has to
be killed with SIGKILL, usually). I have been doing this manually, but
with this shell script and a corresponding "shell command" integration
in Home Assistant, it can be done automatically. This is similar to
how Home Assistant restarts Mopidy on the living room stereo when it
gets into the same kind of state.
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP). Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold. To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
I'm not using Matrix for anything anymore, and it seems to have gone
offline. I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them). I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.
The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*. The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
Zigbee2MQTT commits the cardinal sin of storing state in its
configuration file. This means the file has to be writable and thus
stored in persistent storage rather than in a ConfigMap. As a
consequence, making changes to the configuration when the application is
not running is rather difficult. Case in point: when I added the
internal alias for _mqtt.pyrocufflink.blue_ pointing to the in-cluster
service, Zigbee2MQTT became unable to connect to the broker because it
was using the node port instead of the internal port. Since it could
not connect to the broker, it refused to start, and thus the container
would not stay running long enough to fix the configuration to point
to the correct port.
Fortunately, Zigbee2MQTT also allows configuring settings via
environment variables, which can be managed with a ConfigMap. Luckily,
the values read from environment variables override those from the
configuration file, so pointing to the correct broker port with the
environment variable was sufficient to allow the application to start.
Having name overrides for in-cluster services breaks ACME challenges,
because the server tries to connect to the Service instead of the
Ingress. To fix this, we need to configure both _cert-manager_ and
_step-ca_ to *only* resolve names using the network-wide DNS server.
It turns out, `step ca renew` _can_ renew certificates without mTLS; it
has a `--mtls=false` command-line argument that configures it to use
a JWT signed by the certificate, instead of using the certificate at
the transport layer. This allows clients to renew their certificates
without needing another authentication mechanism, even with the
TLS-terminating proxy.
Invoice Ninja allows attaching documents to invoices, payments,
expenses, etc. Tabitha wants to use this feature to attach receipts for
her expenses, but the photos her phone takes of them are too large for
the default nginx client body limit. We can raise this limit on the
ingress, but we also need to raise it on the "inner" nginx.
The Invoice Ninja container is not designed to be immutable at all; it
makes a bunch of changes to its own contents when it starts up.
Notably, it copies the contents of the `public` and `storage`
directories from the container image to the persistent volume _and then
deletes the source_. Additionally, being a Laravel application, it
needs write access to its own code for caching, etc. Previously, the
`init.sh` script copied the entire `app` directory to a temporary
directory, and then the runtime container mounted that volume over the
top of the original location. This allowed the root filesystem of the
container to be read-only, while the `app` directory was still mutable.
Unfortunately, this makes the startup process incredibly slow, as it
takes a couple of minutes to copy the whole application. It's also
pretty pointless, because the application runs as an unprivileged
process, so it wouldn't have write access to the rest of the filesystem
anyway. As such, I've decided to remove the `readOnlyRootFilesytem`
restriction, and allow the container to run as upstream intends, albeit
begrudgingly.
In-cluster services can now get certificates signed by the DCH CA via
`step-ca`. This issuer uses ACME with the HTTP-01 challenge, so it
can only issue certificates for names in the _pyrocufflink.blue_ zone
that point to the ingress controllers.
Passing port 5671 through the ingress-nginx proxy to the `rabbitmq`
service will allow clients outside the cluster to connect to it.
While we're at it, we'll move the definition of the `tcp-services`
ConfigMap to its own file to make it easier to maintain.
RabbitMQ is an AMQP message broker. It will be used by `xactmon` to
pass messages between the components.
Although RabbitMQ can be deployed in a high-availability cluster, we
don't really need that level of robustness for `xactmon`, so we will
just run a single instance. Deploying a single-host RabbitMQ server
is pretty straightforward.
We're using mTLS authentication; clients need to have a certificate
issued by the *RabbitMQ CA* in order to connect to the message broker.
The `rabbitmq-ca` _cert-manager_ ClusterIssuer issues these certificates
for in-cluster services like `xactmon`.
`xactmon` is a new tool I developed to parse transaction notifications
from banks and automatically import them into my personal finance
tracker. It is designed in a modular fashion, composed of three main
components:
* Receiver
* Processor
* Importer
Components communicate with one another using an AMQP exchange.
Hypothetically, there could be multipel implementations of the receiver
and importer components. Right now, there is only a JMAP receiver,
which fetches email messages (from Fastmail), and a Firefly III
importer. The processor is a singleton, handling notifications from the
receiver, parsing them into a normalized format, and passing them on to
the importer. It uses a set of rules to decide how to parse the
messages, and supports using either a regular expression with named
capture groups or an Awk script to extract the relevant information.
The `xactfetch` script now uses a helper tool, `secretsocket` to
handle looking up secrets. This tool supports various secret source
types, including files, environment variables, and external commands.
Separating this functionality out of the main script makes it a lot
more flexible and pluggable. It's main purpose, though, was actually
to allow `xactfetch` to run in a container while communicating with
`rbw` outside that container, specifically for development puposes.
The `secretsocket` tool reads its configuration from a TOML document.
This document defines the secrets the tool handles, and how to look
them up.
Note that the `xactfetch` container image no longer defines the
`XDG_CONFIG_HOME` environment variable, as it uses Chromium instead of
Firefox now, and the former does not work with a read-only config
directory. As such, we have to mount the `rbw` configuration in the
default location.
Usually, `xactfetch` will only fail for one bank or the other. Rarely
do we want to redownload the data from both banks just because one
failed. The latest version of `xactfetch` supports specifying a bank
name as a CLI argument, so now we can define separate jobs for each
bank. Then, when one Job fails, only that one will be retried later.
It's kind of a bummer that it's so repetitive to define two CronJobs
that differ by only a single command-line argument. I suppose that's
a good argument for using one of the preprocessor tools like Jsonnet
or KCL.
When the `xactfetch` CronJob is triggered manually, it will now skip
the `sleep` step. Presumably, whoever triggered it wants the script
to run _right now_, probably to diagnose a problem.
After the incident this week with the CPU overheating on _vmhost1_, I
want to make sure I know as soon as possible when anything is starting
to get too hot.
When Frigate is down, multiple alerts are generated for each camera, as
Home Assistant creates camera entities for each tracked object. This is
extremely annoying, not to mention unnecessary. To address this, we'll
configure AlertManager to send a single notification for alerts in the
group.
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server. It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.
[0]: https://github.com/prometheus-community/postgres_exporter
Home Assistant uses PostgreSQL for recording the history of entity
states. Since we had been using the in-cluster database server for
this, the data were migrated to the new external PostgreSQL server
automatically when the backup from the former was restored on the
latter. It follows, then, that we can point Home Assistant to the
new server as well.
Home Assistant uses SQLAlchemy, which in turn uses _libpq_ via
_psycopg_, as a client for PostgreSQL. It doesn't expose any
configuration parameters beyond the "database URL" directly, but we
can use the standard environment variables to specify the certificate
and private key for authentication. In fact, the empty `postgresql://`
URL is sufficient, and indicates that _all_ of the connection parameters
should be taken from environment variables. This makes specifying the
parameters for both the `wait-for-db` init container and the main
container take the exact same environment variables, so we can use
YAML anchors to share their definitions.
Since the new database server outside the Kubernetes cluster, created
for Authelia, was seeded from a backup of the in-cluster server, it
already contained the data from Firefly-III as well. Thus, we can
switch Firefly-III to using it, too.
The documentation for Firefly-III does not mention anything about how
to configure it to use certificate-based authentication for PostgreSQL,
as is required by the new server. Fortunately, it ultimately uses
_libpq_, so the standard `PG...` environment variables work fine. We
just need a certificate issued by the _postgresql-ca_ ClusterIssuer and
the _DCH Root CA_ certificate mounted in the Firefly-III container.
If there is an issue with the in-cluster database server, accessing the
Kubernetes API becomes impossible by normal means. This is because the
Kubernetes API uses Authelia for authentication and authorization, and
Authelia relies on the in-cluster database server. To solve this
chicken-and-egg scenario, I've set up a dedicated PostgreSQL database
server on a virtual machine, totally external to the Kubernetes cluster.
With this commit, I have changed the Authelia configuration to point at
this new database server. The contents of the new database server were
restored from a backup from the in-cluster server, so of Authelia's
state was migrated automatically. Thus, updating the configuration is
all that is necessary to switch to using it.
The new server uses certificate-based authentication. In order for
Authelia to access it, it needs a certificate issued by the
_postgresql-ca_ ClusterIssuer, managed by _cert-manager_. Although the
environment variables for pointing to the certificate and private key
are not listed explicitly in the Authelia documentation, their names
can be inferred from the configuration document schema and work as
expected.
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS. We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
I've created a _Pool Time_ calendar in Nextcloud that we can use to
mark when people are expected to be in the pool. Using this, we can
configure the "someone is in the pool" alert not to fire during times
when we know people will be in the pool. This will make it much less
annoying on HLC pool days.
One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
Longhorn does not work well for very large volumes. It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently. This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.
Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well. Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these. Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.
The migration process was relatively straightforwrad. I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database. Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs. Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online. Once everything was up and running, I re-imported
the exported data.
[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
Since all the nodes in the cluster run Fedora CoreOS now, we can
deploy collectd as a container, managed by a DaemonSet.
Note that while _collectd_ has to run as _root_ in order to collect
a lot of metrics, it should not run with all privileges. It does need
to run as a "super-privileged container" (`spc_t` SELinux domain), but
it does _not_ need most kernel capabilities.
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one. If there is a problem starting
that pod, e.g. data corruption, then the others will never start. This
sort of defeats the purpose of having multiple replicas. Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod. This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
The [restic-exporter][0] exposes metrics about Restic snapshots as
Prometheus metrics. This allows us to get similar data as we have for
BURP backups. Chiefly important among the metrics are last backup time
and size, which we can use to determine if backups are working
correctly.
[0]: https://github.com/ngosang/restic-exporter
The digital photo frame in the kitchen is powered by a server service,
which exposes a minimal HTTP API. Using this API, we can e.g. advance
or backtrack the displayed photo. Exposing `rest_command` services
for these operations allows us to add buttons to dashboards to control
the frame.
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
Instead of routing iSCSI traffic from the Kubernetes network, through
the firewall, to the storage network, nodes now have a second network
adapter connected to directly to the storage network. The nodes with
such an adapter are labelled `network.du5t1n.me/storage`, so we can pin
the Jenkins PersistentVolume to them via a node affinity rule.
Using a volume claim template to define the persistent volume claim for
the Redis pod has two advantages: first, it enables using clustered
Redis, if we decide that becomes necessary, and second, it makes
deleteing and recreating the volume easier in the case of data
corruption. Simply scale down the StatefulSet to 0, delete the PVC, and
scale the StatefulSet back up.
Using a volume claim template to define the persistent volume claim for
the Redis pod has two advantages: first, it enables using clustered
Redis, if we decide that becomes necessary, and second, it makes
deleteing and recreating the volume easier in the case of data
corruption. Simply scale down the StatefulSet to 0, delete the PVC, and
scale the StatefulSet back up.
By default, step-ca issues certificates that are valid for only one day.
This means that clients need to have multiple renew attempts scheduled
throughout the day, otherwise, missing one could mean having their
certificates expire. This is unnecessary, and not even possible in all
cases, so let's make the default validity period longer and avoid the
issue.
Since I added an IPv6 ULA prefix to the "main" VLAN (to allow
communicating with the Synology directly), the domain controllers now
have AAAA records. This causes the `sambadc` screpe job to fail because
Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not
have IPv6 addreses.
Managing the Jenkins volume with Longhorn has become increasingly
problematic. Because of its large size, whenever Longhorn needs to
rebuild/replicate it (which happens often for no apparent reason), it
can take several hours. While the synchronization is happening, the
entire cluster suffers from degraded performance.
Instead of using Longhorn, I've decided to try storing the data directly
on the Synology NAS and expose it to Kubernetes via iSCSI. The Synology
offers many of the same features as Longhorn, including
snapshots/rollbacks and backups. Using the NAS allows the volume to be
available to any Kubernetes node, without keeping multiple copies of
the data.
In order to expose the iSCSI service on the NAS to the Kubernetes nodes,
I had to make the storage VLAN routable. I kept it as IPv6-only,
though, as an extra precaution against unauthorized access. The
firewall only allows nodes on the Kubernetes network to access the NAS
via iSCSI.
I originally tried proxying the iSCSI connection via the VM hosts,
however, this failed because of how iSCSI target discovery works. The
provided "target host" is really only used to identify available LUNs;
follow-up communication is done with the IP address returned by the
discovery process. Since the NAS would return its IP address, which
differed from the proxy address, the connection would fail. Thus, I
resorted to reconfiguring the storage network and connecting directly
to the NAS.
To migrate the contents of the volume, I temporarily created a PVC with
a different name and bound it to the iSCSI PersistentVolume. Using a
pod with both the original PVC and the new PVC mounted, I used `rsync`
to copy the data. Once the copy completed, I deleted the Pod and both
PVCs, then created a new PVC with the original name (i.e. `jenkins`),
bound to the iSCSI PV. While doing this, Longhorn, for some reason,
kept re-creating the PVC whenever I would delete it, no matter how I
requested the deletion. Deleting the PV, the PVC, or the Volume, using
either the Kubernetes API or the Longhorn UI, they would all get
recreated almost immediately. Fortunately, there was actually enough of
a delay after deleting it before Longhorn would recreate it that I was
able to create the new PVC manually. Once I did that, Longhorn seemed
to give up.
Kitchen v0.5 a few changes that affect the deployment:
* The Bored Board is now backed by MQTT
* The pool temperature is now displayed in the weather pane
* The container image is now based on Fedora and includes its own time
zone database and root CA bundle
* The websocket server prevents the process from stopping correctly
unless the graceful shutdown feature of `uvicorn` is disabled
[fleetlock] is an implementation of the Zincati FleetLock reboot
coordination protocol. It only works for machines that are Kubernetes
nodes, but it does enable safe rolling updates for those machines.
Specifically, when a node acquires a lock (backed by a Kubernetes
Lease), it cordons that node and evicts pods from it. After the node
has rebooted into the new version of Fedora CoreOS, it uncordons the
node and releases the lock.
[fleetlock]: https://github.com/poseidon/fleetlock
Vaultwarden has started prompting for the master password occasionally
when syncing the vault. Thus, we need to make sure it is available in
the _sync_ container, by mounting the secret and providing the
`PINENTRY_PASSWORD_FILE` environment variable.
Just having the alert name and group name in the ntfy notification is
not enough to really indicate what the problem is, as some alerts can
generate notifications for many reasons. In the email notifications
AlertManager sends by default, the values (but not the keys) of all
labels are included in the subject, so we will reproduce that here.
I don't like having alerts sent by e-mail. Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time. They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly). I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.
Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager. Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other. There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format. Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.
[0]: https://github.com/alexbakker/alertmanager-ntfy
Although most libraries support ED25519 signatures for X.509
certificates, Firefox does not. This means that any certificate signed
by DCH CA R3 cannot be verified by the browser and thus will always
present a certificate error.
I want to migrate internal services that do not need certificates
that are trusted by default (i.e. they are only accessed programatically
or only I use them in the browser) back to using an internal CA instead
of the public *pyrocufflink.net* wildcard certificate. For applications
like Frigate and UniFi Network, these need to be signed by a CA that
the browser will trust, so the ED25519 certificate is inappropriate.
Thus, I've decided to migrate back to DCH CA R2, which uses an EdDSA
signature, and can therefore be trusted by Firefox, etc.
The *hlcforms* application handles form submissions for the Hatch
Learning Center website. It has various features for Tabitha that are
only accessible internally, but the form submission handler itself of
course needs to be accessible anonymously.
A recent version of *Authelia* added a dark theme. Setting the `theme`
option to `auto` enables it when the user agent has the "prefers dark
mode" hint enabled.
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages. Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.
The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
Running Promtail in a pod controlled by a DaemonSet allows it to access
the Kubernetes API via a ServiceAccount token. Since it needs the API
in order to discover the Pods running on the current node in order to
find their log files, this makes the authentication process a lot
simpler.
I discovered today that if anonymous Grafana users have Viewer
permission, they can use the Datasource API to make arbitrary queries
to any backend, even if they cannot access the Explore page directly.
This is documented ([issue #48313][0]) as expected behavior.
I don't really mind giving anonymous access to the Victoria Metrics
datasource, but I definitely don't want anonymous users to be able to
make Loki queries and view log data. Since Grafana Datasource
Permissions is limited to Grafana Enterprise and not available in
the open source version of Grafana, the official recommendation from
upstream is to use a separate Organization for the Loki datasource.
Unfortunately, this would preclude having dashboards that have graphs
from both data sources. Although I don't have any of those right now, I
like the idea and may build some eventually.
Fortunately, I discovered the `send_user_header` Grafana configuration
option. With this enabled, Grafana will send an `X-Grafana-User` header
with the username of the user on whose behalf it is making a request to
the backend. If the user is not logged in, it does not send the header.
Thus, we can detect the presence of this header on the backend and
refuse to serve query requests if it is missing.
[0]: https://github.com/grafana/grafana/issues/48313
Usually, Grafana datastores are configured using its web GUI. When
setting up a datastore that requires TLS client authentication, the
client certificate and private key have to be pasted into the form.
For certificates that renew frequently, this method would require a
frequent manual effort. Fortunately, Grafana supports defining
datastores via its "provisioning" mechanism, reading the configuration
from YAML files on the filesystem.
The Loki CA is used to issue client certificates for Grafana Loki. This
_cert-manager_ ClusterIssuer will allow applications running in
Kubernetes (e.g. Grafana) to request a Certificate that they can use to
access the Loki HTTP API.
I never ended up using _Step CA_ for anything, since I was initially
focused on the SSH CA feature and I was unhappy with how it worked
(which led me to write _SSHCA_). I didn't think about it much until I
was working on deploying Grafana Loki. For that project, I wanted to
use a certificate signed by a private CA instead of the wildcard
certificate for _pyrocufflink.blue_. So, I created *DCH CA R3* for that
purpose. Then, for some reason, I used the exact same procedure to
fetch the certificate from Kubernetes as I had set up for the
_pyrocufflink.blue_ wildcard certificate, as used by Frigate. This of
course defeated the purpose, since I could have just as easily used
the wildcard certificate in that case.
When I discovered that Grafana Loki expects to be deployed behind a
reverse proxy in order to implement access control, I took the
opportunity to reevaluate the certificate issuance process. Since a
reverse proxy is required to implement the access control I want (anyone
can push logs but only authenticated users can query them), it made
sense to choose one with native support for requesting certificates via
ACME. This would eliminate the need for `fetchcert` and the
corresponding Kubernetes API token. Thus, I ended up deciding to
redeploy _Step CA_ with the new _DCH CA R3_ for this purpose.
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*. It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
Apparently, I never bothered to check that the Kitchen HUD server was
actually fetching data from Victoria Metrics when I updated it before; I
only verified that the Unauthorized errors in the `vmselect` log
went away. They did, but only because now the Kitchen server was
failing to contact `vmselect` at all.
I did not realize the batteries on the garage door tilt sensors had
died. Adding alerts for various sensor batteries should help keep me
better informed.
Sometimes, I want to be able to look at active alerts without logging
in. This rule allows read-only access to the AlertManager UI and API.
Unfortunately, the user experience when attempting to create a new
Silence using the UI without first logging in is suboptimal, but I think
that's worth the trade-off.
The Longhorn volume for the *invoice-ninja* PVC got into a strange state
following an unexpected shutdown this morning. One of its replicas
seemed to have disappeared, and it also thought that the size had
changed. As such, it got stuck in "expanding" state, but it was not
actually being expanded. This issue is described in detail in the
Longhorn documentation: [Troubleshooting: Unexpected expansion leads to
degradation or attach failure][0]. Unfortunately, there is no way to
recover a volume from that state, and it must be deleted and recreated
from backup. This changes some of the properties of the PVC, so they
need to be updated in the manifest.
[0]: https://longhorn.io/kb/troubleshooting-unexpected-expansion-leads-to-degradation-or-attach-failure/
Jenkins jobs that build container images need access to `/dev/fuse`.
Thus, we have to allow Pods managed by the *fuse-device-plugin*
DaemonSet to be scheduled on nodes that are tainted for use exclusively
by Jenkins jobs.
Members of the *Server Admins* group need to be able to log in to
machines using their respective privileged accounts for e.g.
provisioning or emergencies.
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it. I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
The configuration file for the kitchen HUD server has credentials
embedded in it. Until I get around to refactoring it to read these from
separate locations, we'll make use of the template feature of
SealedSecrets. With this feature, fields can refer to the (decrypted)
value of other fields using Go template syntax. This makes it possible
to have most of the `config.yaml` document unencrypted and easily
modifiable, while still protecting the secrets.
Now that Victoria Metrics is hosted in Kubernetes, it only makes sense
to host Grafana there as well. I chose to use a single-instance
deployment for simplicity; I don't really need high availability for
Grafana. Its configuration does not change enough to worry about the
downtime associated with restarting it. Migrating the existing data
from SQLite to PostgreSQL, while possible, is just not worth the hassle.
Invoice Ninja is a small business management tool. Tabitha wants to
use it for HLC.
I am a bit concerned about the code quality of this application, and
definitely alarmed at the data it send upstream, so I have tried to be
extra careful with it. All privileges are revoked, including access to
the Internet.
The `update-machine-ids.sh` shell script helps update the `sshca-data`
SealedSecret with the current contents of the `machine-ids.json` file
(stored locally, not tracked in Git).
*vmalert* has been generating alerts and triggering notifications, but
not writing any `ALERTS`/`ALERTS_FOR_STATE` metrics. It turns out this
is because I had not correctly configured the remote read/write
URLs.
If Frigate is running but not connected to the MQTT broker, the
`sensor.frigate_status` entity will be available, but the
`update.frigate_server` entity will not.
# You can configure the database connection by specifying type, host, name, user and password
# as separate properties or as on string using the url property.
# Either "mysql", "postgres" or "sqlite3", it's your choice
type=sqlite3
host=127.0.0.1:3306
name=grafana
user=root
# If the password contains # or ; you have to wrap it with triple quotes. Ex """#password;"""
password=
# Use either URL or the previous fields to configure the database
# Example: mysql://user:secret@host:port/database
url=
# Max idle conn setting default is 2
max_idle_conn=2
# Max conn setting default is 0 (mean not set)
max_open_conn=
# Connection Max Lifetime default is 14400 (means 14400 seconds or 4 hours)
conn_max_lifetime=14400
# Set to true to log the sql calls and execution times.
log_queries=
# For "postgres", use either "disable", "require" or "verify-full"
# For "mysql", use either "true", "false", or "skip-verify".
ssl_mode=disable
ca_cert_path=
client_key_path=
client_cert_path=
server_cert_name=
# For "sqlite3" only, path relative to data_path setting
path=grafana.db
# For "sqlite3" only. cache mode setting used for connecting to the database
cache_mode=private
#################################### Cache server #############################
[remote_cache]
# Either "redis", "memcached" or "database" default is "database"
type=database
# cache connectionstring options
# database: will use Grafana primary database.
# redis: config like redis server e.g. `addr=127.0.0.1:6379,pool_size=100,db=0,ssl=false`. Only addr is required. ssl may be 'true', 'false', or 'insecure'.
# memcache: 127.0.0.1:11211
connstr=
#################################### Data proxy ###########################
[dataproxy]
# This enables data proxy logging, default is false
logging=false
# How long the data proxy waits before timing out, default is 30 seconds.
# This setting also applies to core backend HTTP data sources where query requests use an HTTP client with timeout set.
timeout=30
# How many seconds the data proxy waits before sending a keepalive request.
keep_alive_seconds=30
# How many seconds the data proxy waits for a successful TLS Handshake before timing out.
tls_handshake_timeout_seconds=10
# How many seconds the data proxy will wait for a server's first response headers after
# fully writing the request headers if the request has an "Expect: 100-continue"
# header. A value of 0 will result in the body being sent immediately, without
# waiting for the server to approve.
expect_continue_timeout_seconds=1
# The maximum number of idle connections that Grafana will keep alive.
max_idle_connections=100
# How many seconds the data proxy keeps an idle connection open before timing out.
idle_conn_timeout_seconds=90
# If enabled and user is not anonymous, data proxy will add X-Grafana-User header with username into the request.
# Number dashboard versions to keep (per dashboard). Default: 20, Minimum: 1
versions_to_keep=20
# Minimum dashboard refresh interval. When set, this will restrict users to set the refresh interval of a dashboard lower than given interval. Per default this is 5 seconds.
# The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
min_refresh_interval=1s
# Path to the default home dashboard. If this value is empty, then Grafana uses StaticRootPath + "dashboards/home.json"
# Set to true to automatically assign new users to the default organization (id 1)
auto_assign_org=true
# Set this value to automatically add new users to the provided organization (if auto_assign_org above is set to true)
auto_assign_org_id=1
# Default role new users will be automatically assigned (if auto_assign_org above is set to true)
auto_assign_org_role=Viewer
# Require email validation before sign up completes
verify_email_enabled=false
# Background text for the user field on the login page
login_hint=email or username
password_hint=password
# Default UI theme ("dark" or "light")
default_theme=dark
# External user management
external_manage_link_url=
external_manage_link_name=
external_manage_info=
# Viewers can edit/inspect dashboard settings in the browser. But not save the dashboard.
viewers_can_edit=false
# Editors can administrate dashboard, folders and teams they create
editors_can_admin=false
# The duration in time a user invitation remains valid before expiring. This setting should be expressed as a duration. Examples: 6h (hours), 2d (days), 1w (week). Default is 24h (24 hours). The minimum supported duration is 15m (15 minutes).
user_invite_max_lifetime_duration=24h
[auth]
# Login cookie name
login_cookie_name=grafana_session
# The maximum lifetime (duration) an authenticated user can be inactive before being required to login at next visit. Default is 7 days (7d). This setting should be expressed as a duration, e.g. 5m (minutes), 6h (hours), 10d (days), 2w (weeks), 1M (month). The lifetime resets at each successful token rotation (token_rotation_interval_minutes).
login_maximum_inactive_lifetime_duration=
# The maximum lifetime (duration) an authenticated user can be logged in since login time before being required to login. Default is 30 days (30d). This setting should be expressed as a duration, e.g. 5m (minutes), 6h (hours), 10d (days), 2w (weeks), 1M (month).
login_maximum_lifetime_duration=
# How often should auth tokens be rotated for authenticated users when being active. The default is each 10 minutes.
token_rotation_interval_minutes=10
# Set to true to disable (hide) the login form, useful if you use OAuth
disable_login_form=false
# Set to true to disable the signout link in the side menu. useful if you use auth.proxy
disable_signout_menu=false
# URL to redirect the user to after sign out
signout_redirect_url=
# Set to true to attempt login with OAuth automatically, skipping the login screen.
# This setting is ignored if multiple OAuth providers are configured.
oauth_auto_login=false
# OAuth state max age cookie duration in seconds. Defaults to 600 seconds.
oauth_state_cookie_max_age=600
# limit of api_key seconds to live before expiration
api_key_max_seconds_to_live=-1
# Set to true to enable SigV4 authentication option for HTTP-based datasources
# Used for uploading images to public servers so they can be included in slack/email messages.
# You can choose between (s3, webdav, gcs, azure_blob, local)
provider=
[external_image_storage.s3]
endpoint=
path_style_access=
bucket_url=
bucket=
region=
path=
access_key=
secret_key=
[external_image_storage.webdav]
url=
username=
password=
public_url=
[external_image_storage.gcs]
key_file=
bucket=
path=
enable_signed_urls=false
signed_url_expiration=
[external_image_storage.azure_blob]
account_name=
account_key=
container_name=
[external_image_storage.local]
# does not require any configuration
[rendering]
# Options to configure a remote HTTP image rendering service, e.g. using https://github.com/grafana/grafana-image-renderer.
# URL to a remote HTTP image renderer service, e.g. http://localhost:8081/render, will enable Grafana to render panels and dashboards to PNG-images using HTTP requests to an external service.
server_url=
# If the remote HTTP image renderer service runs on a different server than the Grafana server you may have to configure this to a URL where Grafana is reachable, e.g. http://grafana.domain/.
callback_url=
# Concurrent render request limit affects when the /render HTTP endpoint is used. Rendering many images at the same time can overload the server,
# which this setting can help protect against by only allowing a certain amount of concurrent requests.
concurrent_render_request_limit=30
[panels]
# here for to support old env variables, can remove after a few months
enable_alpha=false
disable_sanitize_html=false
[plugins]
enable_alpha=false
app_tls_skip_verify_insecure=false
# Enter a comma-separated list of plugin identifiers to identify plugins that are allowed to be loaded even if they lack a valid signature.
# Instruct headless browser instance to use a default timezone when not provided by Grafana, e.g. when rendering panel image of alert.
# See ICU’s metaZones.txt (https://cs.chromium.org/chromium/src/third_party/icu/source/data/misc/metaZones.txt) for a list of supported
# timezone IDs. Fallbacks to TZ environment variable if not set.
rendering_timezone=
# Instruct headless browser instance to use a default language when not provided by Grafana, e.g. when rendering panel image of alert.
# Please refer to the HTTP header Accept-Language to understand how to format this value, e.g. 'fr-CH, fr;q=0.9, en;q=0.8, de;q=0.7, *;q=0.5'.
rendering_language=
# Instruct headless browser instance to use a default device scale factor when not provided by Grafana, e.g. when rendering panel image of alert.
# Default is 1. Using a higher value will produce more detailed images (higher DPI), but will require more disk space to store an image.
rendering_viewport_device_scale_factor=
# Instruct headless browser instance whether to ignore HTTPS errors during navigation. Per default HTTPS errors are not ignored. Due to
# the security risk it's not recommended to ignore HTTPS errors.
rendering_ignore_https_errors=
# Instruct headless browser instance whether to capture and log verbose information when rendering an image. Default is false and will
# only capture and log error messages. When enabled, debug messages are captured and logged as well.
# For the verbose information to be included in the Grafana server log you have to adjust the rendering log level to debug, configure
# [log].filter = rendering:debug.
rendering_verbose_logging=
# Instruct headless browser instance whether to output its debug and error messages into running process of remote rendering service.
# Default is false. This can be useful to enable (true) when troubleshooting.
rendering_dumpio=
# Additional arguments to pass to the headless browser instance. Default is --no-sandbox. The list of Chromium flags can be found
# here (https://peter.sh/experiments/chromium-command-line-switches/). Multiple arguments is separated with comma-character.
rendering_args=
# You can configure the plugin to use a different browser binary instead of the pre-packaged version of Chromium.
# Please note that this is not recommended, since you may encounter problems if the installed version of Chrome/Chromium is not
# compatible with the plugin.
rendering_chrome_bin=
# Instruct how headless browser instances are created. Default is 'default' and will create a new browser instance on each request.
# Mode 'clustered' will make sure that only a maximum of browsers/incognito pages can execute concurrently.
# Mode 'reusable' will have one browser instance and will create a new incognito page on each request.
rendering_mode=
# When rendering_mode = clustered you can instruct how many browsers or incognito pages can execute concurrently. Default is 'browser'
# and will cluster using browser instances.
# Mode 'context' will cluster using incognito pages.
rendering_clustering_mode=
# When rendering_mode = clustered you can define maximum number of browser instances/incognito pages that can execute concurrently..
rendering_clustering_max_concurrency=
# Limit the maximum viewport width, height and device scale factor that can be requested.
rendering_viewport_max_width=
rendering_viewport_max_height=
rendering_viewport_max_device_scale_factor=
# Change the listening host and port of the gRPC server. Default host is 127.0.0.1 and default port is 0 and will automatically assign
# a port not in use.
grpc_host=
grpc_port=
[enterprise]
license_path=
[feature_toggles]
# enable features, separated by spaces
enable=
[date_formats]
# For information on what formatting patterns that are supported https://momentjs.com/docs/#/displaying/
# Default system date format used in time range picker and other places where full time is displayed
full_date=YYYY-MM-DD HH:mm:ss
# Used by graph and other places where we only show small intervals
interval_second=HH:mm:ss
interval_minute=HH:mm
interval_hour=MM/DD HH:mm
interval_day=MM/DD
interval_month=YYYY-MM
interval_year=YYYY
# Experimental feature
use_browser_locale=false
# Default timezone for user preferences. Options are 'browser' for the browser local timezone or a timezone name from IANA Time Zone database, e.g. 'UTC' or 'Europe/Amsterdam' etc.