This alert will fire once the MD RAID resynchronization process has
completed and both disks in the array are online. It will clear when
one disk is disconnected and moved to the safe.
When BURP fails to even *start* a backup, it does not trigger a
notification at all. As a result, I may not notice for a few days when
backups are not happening. That was the case this week, when clients'
backups were failing immediately, because of a file permissions issue on
the server. To hopefully avoid missing backups for too long in the
future, I've added two new alerts:
* The *no recent backups* alert fires if there have not been *any* BURP
backups recently. This may also fire, for example, if the BURP
exporter is not working, or if there is something wrong with the BURP
data volume.
* The *missed client backup* alert fires if an active BURP client (i.e.
one that has had at least one backup in the past 90 days) has not been
backed up in the last 24 hours.
Using a 30-day window for the `tlast_change_over_time` function
effectively "caps out" the value at 30 days. Thus, the alert reminding
me to swap the BURP backup volume will never fire, since the value will
never be greater than the 30-day threshold. Using a wider window
resolves that issue (though the query will still produce inaccurate
results beyond the window).
The `tls cafile` setting in `smb.conf` is not necessary. It is used for
verifying peer certificates for mutual TLS authentication, not to
specify the intermediate certificate authority chain like I thought.
The setting cannot simply be left out, though. If it is not specified,
Samba will attempt to load a file from a built-in default path, which
will fail, causing the server to crash. This is avoided by setting the
value to the empty string.
We're going to run MinIO on the BURP server to provide a backup target
for the [Postgres Operator][0]/[WAL-E][1]. Although the Postgres
Operator also supports backups via [WAL-G][2], which supports more
backup targets like SFTP, the operator does not support restoring from
those targets. As such, the best way to get fully-featured backups for
the Postgres Operator, including environment cloning, etc., is to use
S3. Since I absolutely do not want to store my backups "in the cloud,"
using MinIO seems a decent alternative. Running it on the BURP server
allows the backups to be stored and rotated along with regular system
backups.
[0]: https://github.com/zalando/postgres-operator/
[1]: https://github.com/wal-e/wal-e
[2]: https://github.com/wal-g/wal-g
[MinIO][0] is an S3-compatible object storage server. It is designed to
provide storage for cloud-native applications for on-premises
deployments.
MinIO has not been packaged for Fedora (yet?). As such, the best way to
deploy it is usining its official container image. Here, we are using
`podman-systemd-generator` (Quadlet) to generate a systemd service
unit to manage the container process.
The `tlast_change_over_time` function needs an interval wide enough to
consider the range of time we are intrested in. In this case, we want
to see if the BURP volume has been swapped in the last thirty days, so
the interval needs to be `30d`.
I don't use GnuPG for anything else anymore, so it's becoming rather
cumbersome to keep setting it up just for the Ansible Vault secret.
Since I use Bitwarden to store the passphrase for my PGP key anyway, it
makes sense to just store the Ansible Vault secret there directly.
*dc0.p.b* has been gone for a while now. All the current domain
controllers use LDAPS certificates signed by Let's Encrypt and include
the *pyrocufflink.blue* name, so we can now use the apex domain A record
to connect to the directory.
This alert counts how long its been since the number of "active" disks
in the RAID array on the BURP server has changed. The assumption is
that the number will typically be `1`, but it will be `2` when the
second disk synchronized before the swap occurs.
1. Grafana 8 changed the format of the query string parameters for the
Explore page.
2. vmalert no longer needs the http.pathPrefix argument when behind a
reverse proxy, rather it uses the request path like the other
Victoria Metrics components.
The way I am handling swapping out the BURP disk now is by using the
Linux MD RAID driver to manage a RAID 1 mirror array. The array
normally operates with one disk missing, as it is in the fireproof safe.
When it is time to swap the disks, I reattach the offline disk, let the
array resync, then disconnect and store the other disk.
This works considerably better than the previous method, as it does not
require BURP or the NFS server to be offline during the synchronization.
The BURP storage volume is now backed by a Linux MD RAID array, so we
want to monitor its state. Furthermore, since this machine is a
physical device, we should monitor its thermal characteristics as well.
The rule is "if it is accessible on the Internet, its name ends in .net"
Although Vaultwarden can be accessed by either name, the one specified
in the Domain URL setting is the only one that works for WebAuthn.
The HTTP->HTTPS redirect for chmod777.sh was only working by
coincidence. It needs its own virtual host to ensure it works
irrespective of how other websites are configured.
Tabitha's Hatch Learning Center site has two user submission forms: one
for signing in/out students for class, and another for parents to
register new students for the program. These are handled by
*formsubmit* and store data in CSV spreadsheets.
Domain controllers only allow users in the *Domain Admins* AD group to
use `sudo` by default. *dustin* and *jenkins* need to be able to apply
configuration policy to these machines, but they are not members of said
group.
If the Python bindings for SELinux policy management are not installed
when Ansible gathers host facts, no SELinux-related facts will be set.
Thus, any tasks that are conditional based on these facts will not run.
Typically, such tasks are required for SELinux-enabled hosts, but must
not be performed for non-SELinux hosts. If they are not run when they
should, the deployment may fail or applications may experience issues at
runtime.
To avoid these potential issues, the *base* role now forces Ansible to
gather facts again if it installed the Python SELinux bindings.
Note: one might suggest using `meta: clear_facts` instead of `setup` and
letting Ansible decide if and when to gather facts again. Unfortunately,
this for some reason doesn't work; the `clear_facts` meta task just
causes Ansible to crash with a "shared connection to {host} closed."
Some playbooks/roles require facts from machines other than the target.
The `facts.yml` playbook can be used to gather facts from machines
without running any other tasks.
The *dch-selinux* package contains a SELinux policy module for Samba AD
DC. This policy defines a `samba_t` domain for the `samba` process.
While the domain is (currently) unconfined, it is necessary in order to
provide a domain transition rule for `winbindd`. Without this rule,
`winbindd` would run in `unconfined_service_t`, which causes its IPC
pipe files to be incorrectly labelled, preventing other confined
services like `sshd` from accessing them.
The *dch-selinux* package contains customized SELinux policy modules.
I haven't worked out exactly how to build an publish it through a
continuous integration pipeline yet, so for now it's just hosted in my
user `public_html` folder on the main file server.
Samba AD DC does not implement [DFS-R for replication of the SYSVOL][0]
contents. This does not make much of a difference to me, since
the SYSVOL is really only used for Group Policy. Windows machines may
log an error if they cannot access the (basically empty) GPO files, but
that's pretty much the only effect if the SYSVOL is in sync between
domain controllers.
Unfortunately, there is one side-effect of the missing DFS-R
functionality that does matter. On domain controllers, all user,
computer, and group accounts need to have Unix UID/GID numbers mapped.
This is different than regular member machines, which only need UID/GID
numbers for users that will/are allowed to log into them. LDAP entries
only have ID numbers mapped for the latter class of users, which does
not include machine accounts. As a result, Samba falls back to
generating local ID numbers for the rest of the accounts. Those ID
numbers are stored in a local database file,
`/var/lib/samba/private/idmap.ldb`. It would seem that it wouldn't
actually matter if accounts have different ID numbers on different
domain controllers, but there are evidently [situations][1] where DCs
refuse to allocate ID numbers at all, which can cause authentication to
fail. As such, the `idmap.ldb` file needs to be kept in sync.
If we're going to go through the effort of synchronizing `idmap.ldb`, we
might as well keep the SYSVOL in sync as well. To that end, I've
written a script to synchronize both the SYSVOL contents and the
`idmap.ldb` file. It performs a simple one-way synchronization using
`rsync` from the DC with the PDC emulator role, as discovered using DNS
SRV records. To ensure the `idmap.ldb` file is in a consistent state,
it only copies the most recent backup file. If the copied file differs
from the local one, the script stops Samba and restores the local
database from the backup. It then flushes Samba's caches and restarts
the service. Finally, it fixes the NT ACLs on the contents of the
SYSVOL.
Since the contents of the SYSVOL are owned by root, naturally the
synchronization process has to run as root as well. To attempt to limit
the scope of control this would give the process, we use as much of the
systemd sandbox capabilities as possible. Further, the SSH key pairs
the DCs use to authenticate to one another are restricted to only
running rsync. As such, the `sysvolsync` script itself cannot run
`tdbbackup` to back up `idmap.ldb`. To handle that, I've created a
systemd service and corresponding timer unit to run `tdbbackup`
periodically.
I considered for a long time how to best implement this process, and
although I chose this naïve implementation, I am not exactly happy with
it. Since I do not fully understand *why* keeping
the `idmap.ldb` file in sync is necessary, there are undoubtedly cases
where blindly copying it from the PDC emulator is not correct. There
are definitely cases where the contents of the SYSVOL can be updated on
a DC besides the PDC emulator, but again, we should not run into them
because we don't really use the SYSVOL at all. In the end, I think this
solution is good enough for our needs, without being so complicated
[0]: https://wiki.samba.org/index.php?title=SysVol_replication_(DFS-R)&oldid=18120
[1]: https://lists.samba.org/archive/samba/2021-November/238370.html
We need to import the `dyngroups.yml` playbook so that the dynamic host
groups are populated. Without this, the *RedHat* group is empty, so the
*collectd-version* role is never applied.
I changed the naming convention for domain controller machines. They
are no longer "numbered," since the plan is to rotate through them
quickly. For each release of Fedora, we'll create two new domain
controllers, replacing the existing ones. Their names are now randomly
generated and contain letters and numbers, so the Blackbox Exporter
check for DNS records needs to account for this.