Commit Graph

276 Commits (2d5f9e66c1c1c2b1db970b564ff0176273a1a727)

Author SHA1 Message Date
Dustin c1dc52ac29 Merge branch 'loki' 2024-11-05 07:01:13 -06:00
Dustin 39d9985fbd r/loki-caddy: Caddy reverse proxy for Loki
Caddy handles TLS termination for Loki, automatically requesting and
renewing its certificate via ACME.
2024-11-05 06:54:27 -06:00
Dustin 010f652060 hosts: Add loki1.p.b
_loki1.pyrocufflink.blue_ replaces _loki0.pyrocufflink.blue_.  The
former runs Fedora Linux and is managed by Ansible, while the latter ran
Fedora CoreOS and was managed by Ignition and _cfg_.
2024-11-05 06:54:27 -06:00
Dustin 168bfee911 r/webites: Add apps.du5t1n.xyz F-Droid repo
I want to publish the _20125_ Status application to an F-Droid
repository to make it easy for Tabitha to install and update.  F-Droid
repositories are similar to other package repositories: a collection of
packages and some metadata files.  Although there is a fully-fledged
server-side software package that can manage F-Droid repositories, it's
not required: the metadata files can be pre-generated and then hosted by
a static web server just fine.

This commit adds configuration for the web server and reverse proxy to
host the F-Droid repository at _apps.du5t1n.xyz_.
2024-11-05 06:47:02 -06:00
Dustin 7e8aee072e r/bitwarden_rs: Redirect to canonical host name
Bitwarden has not worked correctly for clients using the non-canonical
domain name (i.e. _bitwarden.pyrocufflink.blue_) for quite some time.
This still trips me up occasionally, though, so hopefully adding a
server-side redirect will help.  Eventually, I'll probably remove the
non-canonical name entirely.
2024-11-05 06:37:03 -06:00
Dustin 370a1df7ac dch-proxy: Proxy for dynk8s-provisioner
The reverse proxy needs to handle traffic for the _dynk8s-provisioner_
in order for the ephemeral Jenkins worker nodes in the cloud to work
properly.
2024-11-05 06:30:02 -06:00
Dustin 4cd983d5f4 loki: Add role+playbook for Grafana Loki
The current Grafana Loki server, *loki0.pyrocufflink.blue*, runs Fedora
CoreOS and is managed by Ignition and *cfg*.  Since I have declared
*cfg* a failed experiment, I'm going to re-deploy Loki on a new VM
running Fedora Linux and managed by Ansible.

The *loki* role installs Podman and defines a systemd-managed container
to run Grafana Loki.
2024-10-20 12:10:55 -05:00
Dustin 4ac79ba18d minio-backups: No syslog for nginx access logs
MinIO/S3 clients generate a _lot_ of requests.  It's also not
particularly useful to have these stored in Loki anyway.  As such, we'll
stop routing them to syslog/journal.

Having access logs is somewhat useful for troubleshooting, but really
for only live requests (i.e. what's happening right now).  We therefore
keep the access logs around in a file, but only for one day, so as not
to fill up the filesystem with logs we'll never see.
2024-10-20 12:10:17 -05:00
Dustin 4ae25192d0 vm-hosts: Fix domain label
The `__path__` label is automatically changed to `filename` before the
processing pipeline begins.
2024-10-14 12:32:25 -05:00
Dustin 36145cb2ee minio-backups: Disable nginx log files
We don't need local log files when messages are already stored locally
in the journal and remotely in Loki.
2024-10-14 12:00:19 -05:00
Dustin a0c5ffc869 postgresql: Collect Wal-G metrics with statsd_exporter
_wal-g_ can send StatsD metrics when it completes an upload/backup/etc.
task.  Using the `statsd_exporter`, we can capture these metrics and
make them available to Victoria Metrics.
2024-10-13 20:01:19 -05:00
Dustin 221d3a2be9 vm-hosts: Scrape libvirt logs with Promtail
Collecting logs from VM serial consoles and QEMU monitor.
2024-10-13 18:33:25 -05:00
Dustin 9bea8e1ce7 nextcloud: Scrape logs with Promtail
Nextcloud writes JSON-structured logs to
`/var/lib/nextcloud/data/nextcloud.log`.  These logs contain errors,
etc. from the Nextcloud server, which are useful for troubleshooting.
Having them in Loki will allow us to view them in Grafan as well as
generate alerts for certain events.
2024-10-13 18:05:50 -05:00
Dustin 5ced24f2be hosts: Decommission matrix0.p.b
The Synapse server hasn't been working for a while, but we don't use it
for anything any more anyway.
2024-10-13 12:53:49 -05:00
Dustin dfdddd551f minio-backups: Keep nginx logs for 3 days
_WAL-G_ and _restic_ both generate a lot of HTTP traffic, which fills up
the log volume pretty quickly.  Let's reduce the number of days logs are
kept on the file system.  Logs are shipped to Loki anyway, so there's
not much need to have them local very long.
2024-09-29 11:21:24 -05:00
Dustin 0353360360 dch-proxy: Allow Internet access to IN
Invoice Ninja needs to be accessible from the Internet in order to
receive webhooks from Stripe.  Additionally, Apple Pay requires
contacting Invoice Ninja for domain verification.
2024-09-10 12:01:00 -05:00
Dustin 621f82c88d hosts: Migrate remaining hosts to Restic
Gitea and Vaultwarden both have SQLite databases.  We'll need to add
some logic to ensure these are in a consistent state before beginning
the backup.  Fortunately, neither of them are very busy databases, so
the likelihood of an issue is pretty low.  It's definitely more
important to get backups going again sooner, and we can deal with that
later.
2024-09-07 20:45:24 -05:00
Dustin c2c283c431 nextcloud: Back up Nextcloud with Restic
Now that the database is hosted externally, we don't have to worry about
backing it up specifically.  Restic only backs up the data on the
filesystem.
2024-09-04 17:41:42 -05:00
Dustin 0f4dea9007 restic: Add role+playbook for Restic backups
The `restic.yml` playbook applies the _restic_ role to hosts in the
_restic_ group.  The _restic_ role installs `restic` and creates a
systemd timer and service unit to run `restic backup` every day.

Restic doesn't really have a configuration file; all its settings are
controlled either by environment variables or command-line options. Some
options, such as the list of files to include in or exclude from
backups, take paths to files containing the values.  We can make use of
these to provide some configurability via Ansible variables.  The
`restic_env` variable is a map of environment variables and values to
set for `restic`.  The `restic_include` and `restic_exclude` variables
are lists of paths/patterns to include and exclude, respectively.
Finally, the `restic_password` variable contains the password to decrypt
the repository contents.  The password is written to a file and exposed
to the _restic-backup.service_ unit using [systemd credentials][0].

When using S3 or a compatible service for respository storage, Restic of
course needs authentication credentials.  These can be set using the
`restic_aws_credentials` variable.  If this variable is defined, it
should be a map containing the`aws_access_key_id` and
`aws_secret_access_key` keys, which will be written to an AWS shared
credentials file.  This file is then exposed to the
_restic-backup.service_ unit using [systemd credentials][0].

[0]: https://systemd.io/CREDENTIALS/
2024-09-04 09:40:29 -05:00
Dustin 72936b3868 postgresql: Allow access by IPv6
Since LAN clients have IPv6 addresses now, some may try to connect to
the database over IPv6, so we need to allow this in the host-based
authentication rules.
2024-09-02 21:20:26 -05:00
Dustin a0378feda8 nextcloud: Move database to db0
Moving the Nextcloud database to the central PostgreSQL server will
allow it to take advantage of the monitoring and backups in place there.
For backups specifically, this will make it easier to switch from BURP
to Restic, since now only the contents of the filesystem need backed up.

The PostgreSQL server on _db0_ requires certificate authentication for
all clients.  The certificate for Nextcloud is stored in a Secret in
Kubernetes, so we need to use the _nextcloud-db-cert_ role to install
the script to fetch it.  Nextcloud configuration doesn't expose the
parameters for selecting the certificate and private key files, but
fortunately, they can be encoded in the value provided to the `host`
parameter, though it makes for a rather cumbersome value.
2024-09-02 21:03:33 -05:00
Dustin 7f599e9058 dch-proxy: Proxy Jellyfin
Allow access to Jellyfin from the Internet via the reverse proxy.  The
Jellyfin backend server has a separate port that supports the PROXY
protocol.
2024-09-01 12:42:07 -05:00
Dustin e323324c54 postgresql: Switch wal-g to use new MinIO server
Switching to the MinIO server on _chromie.pyrocufflink.blue_ as
_burp1.pyrocufflink.blue_ is being decommissioned.
2024-09-01 09:01:04 -05:00
Dustin fbf587414a hosts: Add chromie.p.b
*chromie.pyrocufflink.blue* will replace *burp1.pyrocufflink.blue* as
the backup server.  It is running on the hardware that was originally
*nvr1.pyrocufflink.blue*: a 1U Jetway server with an Intel Celeron N3160
CPU and 4 GB of RAM.
2024-09-01 09:01:04 -05:00
Dustin 9d60ae1a61 minio-backups: Deploy MinIO for backups
This playbook uses the *minio-nginx* and *minio-backups-cert* role to
deploy MinIO with nginx.

The S3 API server is *s3.backups.pyrocufflink.blue*, and buckets can be
accessed as subdomains of this name.

The Admin Console is *minio.backups.pyrocufflink.blue*.

Certificates are issued by DCH CA via ACME using `certbot`.
2024-09-01 08:59:28 -05:00
Dustin 3511176c31 r/gitea: Configure SMTP mailer
Gitea needs SMTP configuration in order to send e-mail notifications
about e.g. pull requests.  The `gitea_smtp` variable can be defined to
enable this feature.
2024-08-25 08:46:37 -05:00
Dustin 85da487cb8 r/dch-proxy: Define sites declaratively
I've already made a couple of mistakes keeping the HTTP and HTTPS rules
in sync.  Let's define the sites declaratively and derive the HAProxy
rules from the data, rather then manually type the rules.
2024-08-24 11:48:45 -05:00
Dustin 2fa28dfa5f r/dch-proxy: Update and clean up
The *dch-proxy* role has not been used for quite some time.  The web
server has been handling the reerse proxy functionality, in addition to
hosting websites.  The drawback to using Apache as the reverse proxy,
though, is that it operates in TLS-terminating mode, so it needs to have
the correct certificate for every site and application it proxies for.
This is becoming cumbersome, especially now that there are several sites
that do not use the _pyrocufflink.net_ wildcard certificate.  Notably,
Tabitha's _hatchlearningcenter.org_ is problematic because although the
main site are hosted by the web server, the Invoice Ninja client portal
is hosted in Kubernetes.

Switching back to HAProxy to provide the reverse proxy functionality
will eliminate the need to have the server certificate both on the
backend and on the reverse proxy, as it can operate in TLS-passthrough
mode.  The main reason I stopped using HAProxy in the first place was
because when using TLS-passthrough mode, the original source IP address
is lost.  Fortunately, HAProxy and Apache can both be configured to use
the PROXY protocol, which provides a mechanism for communicating the
original IP address while still passing through the TLS connection
unmodified.  This is particularly important for Nextcloud because of its
built-in intrusion prevention; without knowing the actual source IP
address, it blocks _everyone_, since all connections appear to come from
the reverse proxy's IP address.

Combining TLS-passthrough mode with the PROXY protocol resolves both the
certificate management issue and the source IP address issue.

I've cleaned up the _dch-proxy_ role quite a bit in this commit.
Notably, I consolidated all the backend and frontend definitions into a
single file; it didn't really make sense to have them all separate,
since they were managed by the same role and referred to each other.  Of
course, I had to update the backends to match the currently-deployed
applications as well.
2024-08-24 11:46:28 -05:00
Dustin 153b210a73 vm-hosts: Do not reboot after auto updates
For obvious reasons, the VM hosts cannot automatically reboot
themselves.
2024-08-23 09:33:29 -05:00
Dustin c546f09335 smtp-relay: Rewrite dustin@hatch.name
Sometimes, the mail server for *hatch.name* is extremely slow.  While
there isn't much I can do about it for external senders, I can at least
ensure that email messages sent by internal services like Authelia are
always delivered quickly by rewriting the recipient address to my
actualy email address, bypassing the *hatch.name* exchange entirely.
2024-08-22 16:17:00 -05:00
Dustin a2cf78f3f5 vm-hosts: Update vm-autostart
*logs0.pyrocufflink.blue* has been replaced by *loki0.pyrocufflink.blue*
since ages, so I'm not sure how I hadn't updated the autostart list with
it yet.

*unifi3.pyrocufflink.blue* replaced *unifi2.p.b* recently, when I was
testing *Luci*/etcd.
2024-08-14 20:26:11 -05:00
Dustin 6d65e0594f frigate: Configure HTTPS proxy with creds
Only the _frigate_ user is allowed to access the Github API via the
proxy.
2024-08-14 20:26:11 -05:00
Dustin d2b3b1f7b3 hosts: Deploy production Frigate on nvr2.p.b
*nvr2.pyrocufflink.blue* originally ran Fedora CoreOS.  Since I'm tired
of the tedium and difficulty involved in making configuration changes to
FCOS machines, I am migrating it to Fedora Linux, managed by Ansible.
2024-08-12 22:22:50 -05:00
Dustin 6c71d96f81 r/frigate-caddy: Deploy Caddy in front of Frigate
Deploying Caddy as a reverse proxy for Frigate enables HTTPS with a
certificate issued by the internal CA (via ACME) and authentication via
Authelia.

Separating the installation and base configuratieon of Caddy into its
own role will allow us to reuse that part for other sapplications that
use Caddy for similar reasons.
2024-08-12 18:47:04 -05:00
Dustin 7b61a7da7e r/useproxy: Configure system-wide proxy
The *useproxy* role configures the `http_proxy` et al. environmet
variables for systemd services and interactive shells.  Additionally, it
configures Yum repositories to use a single mirror via the `baseurl`
setting, rather than a list of mirrors via `metalink`, since the proxy
a) the proxy only allows access to _dl.fedoraproject.org_ and b) the
proxy caches RPM files, but this is only effective if all clients use
the same mirror all the time.

The `useproxy.yml` playbook applies this role to servers in the
*needproxy* group.
2024-08-12 18:47:04 -05:00
Dustin 96bc8c2c09 vm-hosts: Update autostart list
*k8s-amd64-n0*, *k8s-amd64-n1*, and *k8s-amd64-n2* have been replaced by
*k8s-amd64-n4*, *k8s-amd64-n5*, *k8s-amd64-n6*, respectively.  *db0* is
the new database server, which needs to be up before anything in
Kubernetes starts, since a lot of applications running there depend on
it.
2024-07-03 08:52:15 -05:00
Dustin 4f202c55e4 r/postgres-exporter: Deploy postgres-exporter
The [postgres-exporter][0] exposes PostgreSQL server statistics to
Prometheus.  It connects to a specified PostgreSQL server (in this
case, a server on the local machine via UNIX socket) and collects data
from the `pg_stat_activity`, et al. views.  It needs the `pg_monitor`
role in order to be allowed to read the relevant metrics.

Since we're setting up the exporter to connect via UNIX socket, it needs
a dedicated OS user to match the PostgreSQL user in order to
authenticate via the _peer_ method.

[0]: https://github.com/prometheus-community/postgres_exporter/
2024-07-02 20:44:29 -05:00
Dustin 3f5550ee6c postgresql: wal-g: Set PGHOST
By default, WAL-G tries to connect to the PostgreSQL server via TCP
socket on the loopback interface.  Our HBA configuration requires
certificate authentication for TCP sockets, so we need to configure
WAL-G to use the UNIX socket.
2024-07-02 20:44:29 -05:00
Dustin 6caf28259e hosts: db0: Promote to primary
All data have been migrated from the PostgreSQL server in Kubernetes and
the three applications that used it (Firefly-III, Authelia, and Home
Assistant) have been updated to point to the new server.

To avoid comingling the backups from the old server with those from the
new server, we're reconfiguring WAL-G to push and pull from a new S3
prefix.
2024-07-02 20:44:29 -05:00
Dustin 208fadd2ba postgresql: Configure for dedicated DB servers
I am going to use the *postgresql* group for the dedicated database
servers.  The configuration for those machines will be quite a bit
different than for the one existing machine that is a member of that
group already: the Nextcloud server.  Rather than undefine/override all
the group-level settings at the host level, I have removed the Nextcloud
server from the *postgresql* group, and updated the `nextcloud.yml`
playbook to apply the *postgresql-server* role itself.

Eventually, I want to move the Nextcloud database to the central
database servers.  At that point, I will remove the *postgresql-server*
role from the `nextcloud.yml` playbook.
2024-07-02 20:44:29 -05:00
Dustin 7201f7ed5c vm-hosts: Expose storage VLAN to VMs
To improve the performance of persistent volumes accessed directly from
the Synology by Kubernetes pods, I've decided to expose the storage
network to the Kubernetes worker node VMs.  This way, iSCSI traffic does
not have to go through the firewall.

I chose not to use the physical interfaces that are already directly
connected to the storage network for this for two reasons: 1) I like
the physical separation of concerns and 2) it would add complexity to
the setup by introducing a bridge on top of the existing bond.
2024-06-23 10:43:15 -05:00
Dustin 6520b86958 k8s-controller: Do not reboot after auto-updates
I don't want the Kubernetes control plane servers rebooting themselves
randomly; I need to coordinate that with other goings-on on the network.
2024-06-23 10:43:15 -05:00
Dustin f0445ebe53 nextcloud: Do not auto-update Nextcloud
Nextcloud usually (always?) wants the `occ upgrade` command to be run
after an update.  If the *nextcloud* package gets updated along with
the rest of the OS, Nextcloud will be down until I manually run that
command hours/days later.
2024-06-23 10:43:15 -05:00
Dustin 24bf145a34 all: Do not auto-update on weekends
I don't want machines updating themselves, rebooting, and potentially
breaking stuff over the weekend.
2024-06-21 22:08:03 -05:00
Dustin 88c45e22b6 vm-hosts: Update VM autostart for new DCs 2024-06-20 18:49:04 -05:00
Dustin 292ab4585c all: promtail: Update trusted CA certificate
Loki uses a certificate signed by *dch-ca r2* now (actually has for
quite some time...)
2024-06-12 18:57:01 -05:00
Dustin ffe972d79b r/samba-cert: Obtain LDAP/TLS cert via ACME
The *samba-cert* role configures `lego` and HAProxy to obtain an X.509
certificate via the ACME HTTP-01 challenge.  HAProxy is necessary
because LDAP server certificates need to have the apex domain in their
SAN field, and the ACME server may contact *any* domain controller
server with an A record for that name.  HAProxy will forward the
challenge request on to the first available host on port 5000, where
`lego` is listening to provide validation.

Issuing certificates this way has a couple of advantages:

1. No need for the wildcard certificate for the *pyrocufflink.blue*
   domain any more
2. Renewals are automatic and handled by the server itself rather than
   Ansible via scheduled Jenkins job

Item (2) is particularly interesting because it avoids the bi-monthly
issue where replacing the LDAP server certificate and restarting Samba
causes the Jenkins job to fail.

Naturally, for this to work correctly, all LDAP client applications
need to trust the certificates issued by the ACME server, in this case
*DCH Root CA R2*.
2024-06-12 18:33:24 -05:00
Dustin 58972cf188 auto-updates: Install and configure dnf-automatic
*dnf-automatic* is an add-on for `dnf` that performs scheduled,
automatic updates.  It works pretty much how I would want it to:
triggered by a systemd timer, sends email reports upon completion, and
only reboots for kernel et al. updates.

In its default configuration, `dnf-automatic.timer` fires every day.  I
want machines to update weekly, but I want them to update on different
days (so as to avoid issues if all the machines reboot at once).  Thus,
the _dnf-automatic_ role uses a systemd unit extension to change the
schedule.  The day-of-the-week is chosen pseudo-randomly based on the
host name of the managed system.
2024-06-12 06:25:17 -05:00
Dustin 1f86fa27b6 vm-hosts: Auto-start unifi2 2024-05-26 10:51:16 -05:00
Dustin 5a9b8b178a hosts: Decommission unifi1
*unifi1.pyrocufflink.blue* is being replaced with
*unifi2.pyrocufflink.blue*.  The new server runs Fedora CoreOS.
2024-05-26 10:50:32 -05:00