Nextcloud writes JSON-structured logs to
`/var/lib/nextcloud/data/nextcloud.log`. These logs contain errors,
etc. from the Nextcloud server, which are useful for troubleshooting.
Having them in Loki will allow us to view them in Grafan as well as
generate alerts for certain events.
_WAL-G_ and _restic_ both generate a lot of HTTP traffic, which fills up
the log volume pretty quickly. Let's reduce the number of days logs are
kept on the file system. Logs are shipped to Loki anyway, so there's
not much need to have them local very long.
Invoice Ninja needs to be accessible from the Internet in order to
receive webhooks from Stripe. Additionally, Apple Pay requires
contacting Invoice Ninja for domain verification.
Gitea and Vaultwarden both have SQLite databases. We'll need to add
some logic to ensure these are in a consistent state before beginning
the backup. Fortunately, neither of them are very busy databases, so
the likelihood of an issue is pretty low. It's definitely more
important to get backups going again sooner, and we can deal with that
later.
The `restic.yml` playbook applies the _restic_ role to hosts in the
_restic_ group. The _restic_ role installs `restic` and creates a
systemd timer and service unit to run `restic backup` every day.
Restic doesn't really have a configuration file; all its settings are
controlled either by environment variables or command-line options. Some
options, such as the list of files to include in or exclude from
backups, take paths to files containing the values. We can make use of
these to provide some configurability via Ansible variables. The
`restic_env` variable is a map of environment variables and values to
set for `restic`. The `restic_include` and `restic_exclude` variables
are lists of paths/patterns to include and exclude, respectively.
Finally, the `restic_password` variable contains the password to decrypt
the repository contents. The password is written to a file and exposed
to the _restic-backup.service_ unit using [systemd credentials][0].
When using S3 or a compatible service for respository storage, Restic of
course needs authentication credentials. These can be set using the
`restic_aws_credentials` variable. If this variable is defined, it
should be a map containing the`aws_access_key_id` and
`aws_secret_access_key` keys, which will be written to an AWS shared
credentials file. This file is then exposed to the
_restic-backup.service_ unit using [systemd credentials][0].
[0]: https://systemd.io/CREDENTIALS/
Since LAN clients have IPv6 addresses now, some may try to connect to
the database over IPv6, so we need to allow this in the host-based
authentication rules.
Moving the Nextcloud database to the central PostgreSQL server will
allow it to take advantage of the monitoring and backups in place there.
For backups specifically, this will make it easier to switch from BURP
to Restic, since now only the contents of the filesystem need backed up.
The PostgreSQL server on _db0_ requires certificate authentication for
all clients. The certificate for Nextcloud is stored in a Secret in
Kubernetes, so we need to use the _nextcloud-db-cert_ role to install
the script to fetch it. Nextcloud configuration doesn't expose the
parameters for selecting the certificate and private key files, but
fortunately, they can be encoded in the value provided to the `host`
parameter, though it makes for a rather cumbersome value.
*chromie.pyrocufflink.blue* will replace *burp1.pyrocufflink.blue* as
the backup server. It is running on the hardware that was originally
*nvr1.pyrocufflink.blue*: a 1U Jetway server with an Intel Celeron N3160
CPU and 4 GB of RAM.
This playbook uses the *minio-nginx* and *minio-backups-cert* role to
deploy MinIO with nginx.
The S3 API server is *s3.backups.pyrocufflink.blue*, and buckets can be
accessed as subdomains of this name.
The Admin Console is *minio.backups.pyrocufflink.blue*.
Certificates are issued by DCH CA via ACME using `certbot`.
Gitea needs SMTP configuration in order to send e-mail notifications
about e.g. pull requests. The `gitea_smtp` variable can be defined to
enable this feature.
I've already made a couple of mistakes keeping the HTTP and HTTPS rules
in sync. Let's define the sites declaratively and derive the HAProxy
rules from the data, rather then manually type the rules.
The *dch-proxy* role has not been used for quite some time. The web
server has been handling the reerse proxy functionality, in addition to
hosting websites. The drawback to using Apache as the reverse proxy,
though, is that it operates in TLS-terminating mode, so it needs to have
the correct certificate for every site and application it proxies for.
This is becoming cumbersome, especially now that there are several sites
that do not use the _pyrocufflink.net_ wildcard certificate. Notably,
Tabitha's _hatchlearningcenter.org_ is problematic because although the
main site are hosted by the web server, the Invoice Ninja client portal
is hosted in Kubernetes.
Switching back to HAProxy to provide the reverse proxy functionality
will eliminate the need to have the server certificate both on the
backend and on the reverse proxy, as it can operate in TLS-passthrough
mode. The main reason I stopped using HAProxy in the first place was
because when using TLS-passthrough mode, the original source IP address
is lost. Fortunately, HAProxy and Apache can both be configured to use
the PROXY protocol, which provides a mechanism for communicating the
original IP address while still passing through the TLS connection
unmodified. This is particularly important for Nextcloud because of its
built-in intrusion prevention; without knowing the actual source IP
address, it blocks _everyone_, since all connections appear to come from
the reverse proxy's IP address.
Combining TLS-passthrough mode with the PROXY protocol resolves both the
certificate management issue and the source IP address issue.
I've cleaned up the _dch-proxy_ role quite a bit in this commit.
Notably, I consolidated all the backend and frontend definitions into a
single file; it didn't really make sense to have them all separate,
since they were managed by the same role and referred to each other. Of
course, I had to update the backends to match the currently-deployed
applications as well.
Sometimes, the mail server for *hatch.name* is extremely slow. While
there isn't much I can do about it for external senders, I can at least
ensure that email messages sent by internal services like Authelia are
always delivered quickly by rewriting the recipient address to my
actualy email address, bypassing the *hatch.name* exchange entirely.
*logs0.pyrocufflink.blue* has been replaced by *loki0.pyrocufflink.blue*
since ages, so I'm not sure how I hadn't updated the autostart list with
it yet.
*unifi3.pyrocufflink.blue* replaced *unifi2.p.b* recently, when I was
testing *Luci*/etcd.
*nvr2.pyrocufflink.blue* originally ran Fedora CoreOS. Since I'm tired
of the tedium and difficulty involved in making configuration changes to
FCOS machines, I am migrating it to Fedora Linux, managed by Ansible.
Deploying Caddy as a reverse proxy for Frigate enables HTTPS with a
certificate issued by the internal CA (via ACME) and authentication via
Authelia.
Separating the installation and base configuratieon of Caddy into its
own role will allow us to reuse that part for other sapplications that
use Caddy for similar reasons.
The *useproxy* role configures the `http_proxy` et al. environmet
variables for systemd services and interactive shells. Additionally, it
configures Yum repositories to use a single mirror via the `baseurl`
setting, rather than a list of mirrors via `metalink`, since the proxy
a) the proxy only allows access to _dl.fedoraproject.org_ and b) the
proxy caches RPM files, but this is only effective if all clients use
the same mirror all the time.
The `useproxy.yml` playbook applies this role to servers in the
*needproxy* group.
*k8s-amd64-n0*, *k8s-amd64-n1*, and *k8s-amd64-n2* have been replaced by
*k8s-amd64-n4*, *k8s-amd64-n5*, *k8s-amd64-n6*, respectively. *db0* is
the new database server, which needs to be up before anything in
Kubernetes starts, since a lot of applications running there depend on
it.
The [postgres-exporter][0] exposes PostgreSQL server statistics to
Prometheus. It connects to a specified PostgreSQL server (in this
case, a server on the local machine via UNIX socket) and collects data
from the `pg_stat_activity`, et al. views. It needs the `pg_monitor`
role in order to be allowed to read the relevant metrics.
Since we're setting up the exporter to connect via UNIX socket, it needs
a dedicated OS user to match the PostgreSQL user in order to
authenticate via the _peer_ method.
[0]: https://github.com/prometheus-community/postgres_exporter/
By default, WAL-G tries to connect to the PostgreSQL server via TCP
socket on the loopback interface. Our HBA configuration requires
certificate authentication for TCP sockets, so we need to configure
WAL-G to use the UNIX socket.
All data have been migrated from the PostgreSQL server in Kubernetes and
the three applications that used it (Firefly-III, Authelia, and Home
Assistant) have been updated to point to the new server.
To avoid comingling the backups from the old server with those from the
new server, we're reconfiguring WAL-G to push and pull from a new S3
prefix.
I am going to use the *postgresql* group for the dedicated database
servers. The configuration for those machines will be quite a bit
different than for the one existing machine that is a member of that
group already: the Nextcloud server. Rather than undefine/override all
the group-level settings at the host level, I have removed the Nextcloud
server from the *postgresql* group, and updated the `nextcloud.yml`
playbook to apply the *postgresql-server* role itself.
Eventually, I want to move the Nextcloud database to the central
database servers. At that point, I will remove the *postgresql-server*
role from the `nextcloud.yml` playbook.
To improve the performance of persistent volumes accessed directly from
the Synology by Kubernetes pods, I've decided to expose the storage
network to the Kubernetes worker node VMs. This way, iSCSI traffic does
not have to go through the firewall.
I chose not to use the physical interfaces that are already directly
connected to the storage network for this for two reasons: 1) I like
the physical separation of concerns and 2) it would add complexity to
the setup by introducing a bridge on top of the existing bond.
Nextcloud usually (always?) wants the `occ upgrade` command to be run
after an update. If the *nextcloud* package gets updated along with
the rest of the OS, Nextcloud will be down until I manually run that
command hours/days later.
The *samba-cert* role configures `lego` and HAProxy to obtain an X.509
certificate via the ACME HTTP-01 challenge. HAProxy is necessary
because LDAP server certificates need to have the apex domain in their
SAN field, and the ACME server may contact *any* domain controller
server with an A record for that name. HAProxy will forward the
challenge request on to the first available host on port 5000, where
`lego` is listening to provide validation.
Issuing certificates this way has a couple of advantages:
1. No need for the wildcard certificate for the *pyrocufflink.blue*
domain any more
2. Renewals are automatic and handled by the server itself rather than
Ansible via scheduled Jenkins job
Item (2) is particularly interesting because it avoids the bi-monthly
issue where replacing the LDAP server certificate and restarting Samba
causes the Jenkins job to fail.
Naturally, for this to work correctly, all LDAP client applications
need to trust the certificates issued by the ACME server, in this case
*DCH Root CA R2*.
*dnf-automatic* is an add-on for `dnf` that performs scheduled,
automatic updates. It works pretty much how I would want it to:
triggered by a systemd timer, sends email reports upon completion, and
only reboots for kernel et al. updates.
In its default configuration, `dnf-automatic.timer` fires every day. I
want machines to update weekly, but I want them to update on different
days (so as to avoid issues if all the machines reboot at once). Thus,
the _dnf-automatic_ role uses a systemd unit extension to change the
schedule. The day-of-the-week is chosen pseudo-randomly based on the
host name of the managed system.
The UniFi controller can act as a syslog server, receiving log messages
from managed devices and writing them to files in the `logs/remote`
directory under the application data directory. We can scrape these
logs, in addition to the logs created by the UniFi server itself, with
Promtail to get more information about what's happening on the network.
Promtail is the log sending client for Grafana Loki. For traditional
Linux systems, an RPM package is available from upstream, making
installation fairly simple. Configuration is stored in a YAML file, so
again, it's straightforward to configure via Ansible variables. Really,
the only interesting step is adding the _promtail_ user, which is
created by the RPM package, to the _systemd-journal_ group, so that
Promtail can read the systemd journal files.
The internal DNS server for the *pyrocufflink.blue* et al. domains runs
on the firewall now, and is thus no longer managed by Ansible. Dropping
the group variables so the file encrypted with Ansible Vault can go
away.
The *dustinandtabitha.com* website no longer uses *formsubmit* (the time
for RSVP has **long** passed). Removing the configuration so the
file encrypted with Ansible Vault can go away.
The `TrustedUserCAKeys` setting for *sshd(8)* tells the server to accept
any certificates signed by keys listed in the specified file.
The authenticating username has to match one of the principals listed in
the certificate, of course.
This role is applied to all machines, via the `base.yml` playbook.
Certificates issued by the user CA managed by SSHCA will therefore be
trusted everywhere. This brings us one step closer to eliminating the
dependency on Active Directory/Samba.
Normal users do not need shell access to the file server, and certainly
should not be allowed to e.g. forward ports through it. Using a `Match`
block, we can apply restrictions to users who do not need administrative
functionality. In this case, we restrict everyone who is not a member
of the *Server Admins* group in the PYROCUFFLINK AD domain.
By default, `sudo` requires users to authenticate with their passwords
before granting them elevated privileges. It can be configured to
allow (some) users access to (some) privileged commands without
prompting for a password (i.e. `NOPASSWD`), however this has a real
security implication. Disabling the password requirement would
effectively grant *any* program root privileges. Prompting for a
password prevents malicious software from running privileged commands
without the user knowing.
Unfortunately, handling `sudo` authentication for Ansible is quite
cumbersome. For interactive use, the `--ask-become-pass`/`-K` argument
is useful, though entering the password for each invocation of
`ansible-playbook` while iterating on configuration policy development
is a bit tedious. For non-interactive use, though, the password of
course needs to be stored somewhere. Encrypting it with Ansible Vault
is one way to protect it, but it still ends up stored on disk somewhere
and needs to be handled carefully.
*pam_ssh_agent_auth* provides an acceptable solution to both issues. It
is better than disabling `sudo` authentication entirely, but a lot more
convenient than dealing with passwords. It uses the calling user's SSH
agent to assert that the user has access to a private key corresponding
to one of the authorized public keys. Using SSH agent forwarding, that
private key can even exist on a remote machine. If the user does not
have a corresponding private key, `sudo` will fall back to normal
password-based authentication.
The security of this solution is highly dependent on the client to store
keys appropriately. FIDO2 keys are supported, though when used with
Ansible, it is quite annoying to have to touch the token for _every
task_ on _every machine_. Thus, I have created new FIDO2 keys for both
my laptop and my desktop that have the `no-touch-required` option
enabled. This means that in order to use `sudo` remotely, I still need
to have my token plugged in to my computer, but I do not have to tap it
every time it's used.
For Jenkins, a hardware token is obviously impossible, but using a
dedicated key stored as a Jenkins credential is probably sufficient.