This alert counts how long its been since the number of "active" disks
in the RAID array on the BURP server has changed. The assumption is
that the number will typically be `1`, but it will be `2` when the
second disk synchronized before the swap occurs.
1. Grafana 8 changed the format of the query string parameters for the
Explore page.
2. vmalert no longer needs the http.pathPrefix argument when behind a
reverse proxy, rather it uses the request path like the other
Victoria Metrics components.
The way I am handling swapping out the BURP disk now is by using the
Linux MD RAID driver to manage a RAID 1 mirror array. The array
normally operates with one disk missing, as it is in the fireproof safe.
When it is time to swap the disks, I reattach the offline disk, let the
array resync, then disconnect and store the other disk.
This works considerably better than the previous method, as it does not
require BURP or the NFS server to be offline during the synchronization.
The rule is "if it is accessible on the Internet, its name ends in .net"
Although Vaultwarden can be accessed by either name, the one specified
in the Domain URL setting is the only one that works for WebAuthn.
Domain controllers only allow users in the *Domain Admins* AD group to
use `sudo` by default. *dustin* and *jenkins* need to be able to apply
configuration policy to these machines, but they are not members of said
group.
I changed the naming convention for domain controller machines. They
are no longer "numbered," since the plan is to rotate through them
quickly. For each release of Fedora, we'll create two new domain
controllers, replacing the existing ones. Their names are now randomly
generated and contain letters and numbers, so the Blackbox Exporter
check for DNS records needs to account for this.
Gitea package names (e.g. OCI images, etc.) can contain `/` charactres.
These are encoded as %2F in request paths. Apache needs to forward
these sequences to the Gitea server without decoding them.
Unfortunately, the `AllowEncodedSlashes` setting, which controls this
behavior, is a per-virtualhost setting that is *not* inherited from the
main server configuration, and therefore must be explicitly set inside
the `VirtualHost` block. This means Gitea needs its own virtual host
definition, and cannot rely on the default virtual host.
When I added the *systemd-networkd* configuration for the Kubernetes
network interface on the VM hosts, I only added the `.netdev`
configuration and forgot the `.network` part. Without the latter,
*systemd-networkd* creates the interface, but does not configure or
activate it, so it is not able to handle traffic for the VMs attached to
the bridge.
The *netboot/jenkins-agent* Ansible role configures three NBD exports:
* A single, shared, read-only export containing the Jenkins agent root
filesystem, as a SquashFS filesystem
* For each defined agent host, a writable data volume for Jenkins
workspaces
* For each defined agent host, a writable data volume for Docker
Agent hosts must have some kind of unique value to identify their
persistent data volumes. Raspberry Pi devices, for example, can use the
SoC serial number.
The `-external.url` and `-external.alert.source` command line arguments
and their corresponding environment variables can be used to configure
the "Source" links associated with alerts created by `vmalert`.
The firewall hardware is too slow to run the *prometheus_speedtest*
program. It always showed *way* lower speeds than were actually
available. I've moved the service to the Kubernetes cluster and it
works a lot better there.
*mtrcs0.pyrocufflink.red* is a Raspberry Pi CM4 on a Waveshare
CM4-IO-BASE-B carrier board with a NVMe SSD. It runs a custom OS built
using Buildroot, and is not a member of the *pyrocufflink.blue* AD
domain.
*mtrcs0.p.r* hosts Victoria Metrics/`vmagent`, `vmalert`, AlertManager,
and Grafana. I've created a unique group and playbook for it,
*metricspi*, to manage all these applications together.
In addition to ignoring particular types of filesystems, e.g. OverlayFS,
we can also ignore filesystems by their mount point. This could be
useful, for example, for bind-mounted directories, such as those used on
Kubernetes nodes.
There is no specific playbook or role for Kubernetes. All OS
configuration is done at install time via kickstart scripts, and
deploying Kubernetes itself is done (manually) using `kubeadm init` and
`kubeadm join`.
Adding a second camera to the back yard, on the North side of the porch,
to try and figure out how the possums keep getting under the porch even
with the chicken wire around it!
We're trying to discover how the possums are getting into and out of the
house. Let's enable continuous video recording from the back yard
camera so we can observe them and come up with a plan to get rid of
them.
To handle the RSVP form on *dustinandtabitha.com*, we are going to use
*formsubmit*. It runs on the same machine that hosts the website, so
there's no dealing with CORS. The */submit/rsvp* path, which is proxied
to the backend, is the RSVP form's target.
The state history database is entirely too big. It takes over an hour
to create a backup of it, which usually causes BURP to time out. The
data it stores isn't particularly interesting anyway. Instead of trying
to back it up and ultimately not getting any backup at all, we'll just
skip it altogether to ensure we have a consistent backup of everything
else that is actually important.
If Nextcloud does not have the Internet-facing reverse proxy listed in
its "trusted proxies" setting, it will mark all traffic as being from
the proxy itself. This breaks brute force detection, etc.
Transitioning from push-based to pull-based monitoring with
Prometheus/collectd. The *write_prometheus* plugin will be installed on
all hosts, and Prometheus will be configured to scrape them directly.
Having this option enabled dramatically improves the reliability of
collectd multicast traffic from physical machines and VMs on a separate
VM host from the receiving machine.
1. Set a password for *root* on all machines (useful for logging in via
serial console if network is down)
2. Set an authorized SSH key for root on all machines:
* For Fedora 34, use my FIDO2 security token key
* For all other hosts, use my ED25519 key
Filesystems like NFS and CIFS require "helper" utilities (i.e.
`mount.nfs` and `mount.cifs`, respectively). These need to be installed
in order for a system to be able to mount those filesystems.
The current shared storage system uses NFSv4, and as such, the
*nfs-utils* package needs to be installed on the VM hosts.
Originally, the network configuration for the VM networks and the
storage network was configured using the *netifaces* role. This has
effectively stopped working in recent versions of Fedora, as it sort of
relied on `dhcpcd`, which has not been maintained in Fedora for a while
and no longer behaves correctly. After evaluating *NetworkManager* as a
replacement, I decided that *systemd-networkd* is a more appropriate
solution.
There are effectively two "layers" of network configuration needed for
the VM hosts: the host-specific settings, and the common settings. The
host-specific settings include such properties as the IP address of the
management interface and the names of the physical ports that make up
the bonded interfaces. The common settings are the bonded interfaces,
the VLAN interfaces created on top of the bond, and the bridges that
provide access to VMs.
To configure the host-specific settings, each host simply needs the
appropriate `networkd_*` variables in its `host_vars` file. For the
common settings, we apply the *systemd-networkd* role again in the
`vmhost.yml` with different values for these variables. Thus,
effectively, `systemd-networkd.yml` manages the host-specific settings,
while `vmhost.yml` manages the common settings.
I couldn't get RTMP to work on the Back Yard camera because the `ffmpeg`
process kept crashing:
```
ffmpeg.back_yard.clips_rtmp ERROR : av_interleaved_write_frame(): Connection reset by peer
ffmpeg.back_yard.clips_rtmp ERROR : [flv @ 0x5562090c8ec0] Failed to update header with correct duration.
ffmpeg.back_yard.clips_rtmp ERROR : [flv @ 0x5562090c8ec0] Failed to update header with correct filesize.
ffmpeg.back_yard.clips_rtmp ERROR : Error writing trailer of rtmp://127.0.0.1/live/back_yard: Connection reset by peer
watchdog.back_yard INFO : Terminating the existing ffmpeg process...
watchdog.back_yard INFO : Waiting for ffmpeg to exit gracefully...
```
I thought increasing the value of `--shm-size` argument for `podman`
would help, but even going as high as 1024 mebibytes did not resolve the
problem.
Ultimately, I decided that it is not really necessary to view the full
4k stream in real time. The back yard camera supports three streams, so
I set them all up for different roles. I briefly considered using a
single 1080p stream for both object detection and RTMP streaming, but
this consumed considerable CPU time, so I decided against it for now. I
may re-evaluate that option if I decide to purchase a TPU.
*nvr0.pyrocufflink.blue* hosts Frigate. It is deployed on a separate
subnet, for two reasons:
* To avoid streaming video from the cameras through the firewall
* To prevent any hosts on the LAN except Home Assistant from
communicating with Frigate, since it does not have any kind of
authentication or access control
For hosts that cannot send metrics via multicast (e.g. because they are
on a different subnet), *collectd* needs to listen on the all-hosts
unicast address.
ZwaveJS2Mqtt includes a very powerful web-based UI for configuring and
controlling the Z-Wave network. This functionality is no longer
available within Home Assistant itself, so being able to access the
ZwaveJS2Mqtt UI is crucial to operating the network.
I wanted to make the UI available at */zwave/*, which requires using
*mod_rewrite* to conditionally proxy requests based on the `Connection`
HTTP header, since the UI passes both HTTP and WebSocket requests to the
same paths. *mod_rewrite* configuration is not inherited from the main
server configuration to virtual hosts, so the
`RewriteRule`/`RewriteCond` directives have to be specified within the
`<VirtualHost>` block. This means that the Home Assistant proxy
configuration has to be within its own virtual host, and the
Zwavejs2Mqtt configuration has to be there as well.
Mosquitto 2.x included two significant changes from 1.6:
* There is no longer a "default" listener; all listeners are configured
in the same way
* The daemon drops privileges *before* reading TLS certificates and
private keys
Although configuration policy is not yet available for Prometheus
itself, the `collectd.yml` playbook also uses the *prometheus* host
group. Specifically, hosts in this group are configured to receive
collectd data from other hosts and expose those data through the
`write_prometheus` plugin.
This commit introduces the *grafana* role and the corresponding
`grafana.yml` playbook. The role installs Grafana using the system
package manager, and configures the server (including LDAP
authentication).
Since the Nextcloud configuration file is managed by the configuration
policy, all of the settings configurable through the web UI need to be
templated. One important group of settings is the outbound email
configuration. This can now be configured using the `nextcloud_smtp`
Ansible variable.
Fedora now includes a packaged version of Nextcloud. This will be
_much_ easier to maintain than the tarball-based distribution method.
There are some minor differences in how the Fedora package works,
compared to the upstream tarball. Notably, it puts the configuration
file in `/etc/` and makes it read-only, and it stores persistent data
separate from the application. These differences require modifications
to the Apache and PHP-FPM configuration, but the package also included
examples to make this easier. Since the `config.php` is read-only now,
it has to be managed by the configuration policy; it cannot be modified
by the Administration web UI.
*Mosquitto* implements an MQTT server. It is the recommended
implementation for using MQTT with Home Assistant.
I have added this role to deploy Mosquitto on the Home Assistant server.
It will be used to send data from custom sensors, such as the
temperature/pressure/humidity sensor connected to the living room wall
display.
When there is a network issue that prevents DNS names from being
resolved, it can be difficult to troubleshoot. For example, last night,
the Samba domain controller crashed, so *pyrocufflink.blue* names were
unavailable. Furthermore, the domain controller VM was apparently
locked up, so I could not SSH into it directly, and it needed to be
rebooted. Since the VM host's name did not resolve, I could not find
its address to log into it and reboot the VM. I resorted to scanning
the SSH keys of every IP address on the network until I found the one
that matched the cached key in ~/.ssh/known_hosts. This was cumbersome
and annoying.
Assigning DHCP reservations to the VM hosts will ensure that when a
situation like this arises again, I can quickly connect to the correct
VM host and manage its virtual machines, as its address is recorded in
the configuration policy.
The *synapse* role and the corresponding `synapse.yml` playbook deploy
Synapse, the reference Matrix homeserver implementation.
Deploying Synapse itself is fairly straightforward: it is packaged by
Fedora and therefore can simply be installed via `dnf` and started by
`systemd`. Making the service available on the Internet, however, is
more involved. The Matrix protocol mostly works over HTTPS on the
standard port (443), so a typical reverse proxy deployment is mostly
sufficient. Some parts of the Matrix protocol, however, involve
communication over an alternate port (8448). This could be handled by a
reverse proxy as well, but since it is a fairly unique port, it could
also be handled by NAT/port forwarding. In order to support both
deployment scenarios (as well as the hypothetical scenario wherein the
Synapse machine is directly accessible from the Internet), the *synapse*
role supports specifying an optional `matrix_tls_cert` variable. If
this variable is set, it should contain the path to a certificate file
on the Ansible control machine that will be used for the "direct"
connections (i.e. on port 8448). If it is not set, the default Apache
certificate will be used for both virtual hosts.
Synapse has a pretty extensive configuration schema, but most of the
options are set to their default values by the *synapse* role. Other
than substituting secret keys, the only exposed configuration option is
the LDAP authentication provider.
Since DNS only allowed to be sent over the VPN, it is not possible to
resolve the VPN server name unless the VPN is already connected. This
naturally creates a chicken-and-egg scenario, which we can resolve by
manually providing the IP address of the server we want to connect to.
This commit adds a new playbook, `protonvpn.yml`, and its supporting
roles *strongswan-swanctl* and *protonvpn*. This playbook configures
strongSwan to connect to ProtonVPN using IPsec/IKEv2.
With this playbook, we configure the name servers on the Pyrocufflink
network to route all DNS requests through the Cloudflare public DNS
recursive servers at 1.1.1.1/1.0.0.1 over ProtonVPN. Using this setup,
we have the benefit of the speed of using a public DNS server (which is
*significantly* faster than running our own recursive server, usually by
1-2 seconds per request), and the benefit of anonymity from ProtonVPN.
Using the public DNS server alone is great for performance, but allows
the server operator (in this case Cloudflare) to track and analyze usage
patterns. Using ProtonVPN gives us anonymity (assuming we trust
ProtonVPN not to do the very same tracking), but can have a negative
performance impact if its used for all Internet traffic. By combining
these solutions, we can get the benefits of both!
This commit adds two new variables to the *named* role:
`named_queries_syslog` and `named_rpz_syslog`. These variables control
whether BIND will send query and RPZ log messages to the local syslog
daemon, respectively.
BIND response policy zones (RPZ) support provides a mechanism for
overriding the responses to DNS queries based on a wide range of
criteria. In the simplest form, a response policy zone can be used to
provide different responses to different clients, or "block" some DNS
names.
For the Pyrocufflink and related networks, I plan to use an RPZ to
implement ad/tracker blocking. The goal will be to generate an RPZ
definition from a collection of host lists (e.g. those used by uBlock
Origin) periodically.
This commit introduces basic support for RPZ configuration in the
*named* role. It can be activated by providing a list of "response
policy" definitions (e.g. `zone "name"`) in the `named_response_policy`
variable, and defining the corresponding zones in `named_zones`.
The TCP ports 80 and 443 are NAT-forwarded by the gateway to 172.30.0.6.
This address was originally occupied by *rprx0.pyrocufflink.blue*, which
operated a reverse proxy for Bitwarden, Gitea, Jenkins, Nextcloud,
OpenVPN, and the public web sites. Now, *web0.pyrocufflink.blue* is
configured to proxy for those services that it does not directly host
itself, effectively making it a web server, a reverse proxy, and a
forward proxy (for OpenVPN only). Thus, it must take over this address
in order to receive forwarded traffic from the Internet.
This commit updates the configuration for *pyrocufflink.net* to use the
wildcard certificate managed by *lego* instead of an unique certificate
managed by *certbot*.
*chmod777.sh* is a simple static website, generated by Hugo. It is
built and published from a Jenkins pipeline, which runs automatically
when new commits are pushed to Gitea.
The HTTPS certificate for this site is signed by Let's Encrypt and
managed by `lego` in the `certs` submodule.
The *nextcloud* role installs Nextcloud from the specified release
archive, downloading it to the control machine first if necessary, and
configures Apache and PHP-FPM to serve it.
The `nextcloud.yml` playbook uses the *cert* role to install the X.509
certificate for the Nextcloud server, sets up Apache HTTPD with the
*apache* role, and installs Nextcloud using the *nextcloud* role.
The host *cloud0.pyrocufflink.blue* is the Nextcloud server for
Pyrocufflink.
*burp1.pyrocufflink.blue* will replace *burp0.pyrocufflink.blue* as the
BURP server for Pyrocufflink. It is a physical machine (Fitlet), making
it simpler to manage the USB drives. The old virtual machine will be
decommissioned soon.
Using the generic *burp.pyrocufflink.blue* name will allow easier
transition to a new BURP server. However, since this is not the actual
name, it cannot be used for task delegation, so a separate variable is
required to store the real name of the BURP server. This is only used
during client deployment, and not by BURP itself.
In order to allow Jenkins to connect to the Docker daemon socket, the
socket must be owned by the *docker* group, and the *jenkins* user must
be a member of it.
*serial0.pyrocufflink.blue* has a manually-configured IP address now, to
ensure it always has an addresss, even if the DHCP server is
unavailable. Recording it here to ensure the address does not
accidentally get reused.
This commit configures *bw0.pyrocufflink.blue* as a BURP client, so that
the Bitwarden data can be backed up. A pre-backup script is used to
take a consistent snapshot of the SQLite database before copying it to
the BURP server.
*dns1.pyrocufflink.blue* has been decommissioned. Having a second DNS
server never really worked correctly for some reason, and the
maintenance overhead of the Raspberry Pi is just not worth it right now.
The DHCP service has been moved to *dns0.pyrocufflink.blue*.