1
0
Fork 0
Commit Graph

50 Commits (94b7168b1eb4f6304c85305d899790a3162a5dec)

Author SHA1 Message Date
Dustin 7dffb5195a v-m: alertmanager: Group disk usage alerts
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP).  Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold.  To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
2024-08-17 10:59:05 -05:00
Dustin 02001f61db v-m/scrape: webistes: Stop scraping Matrix
I'm not using Matrix for anything anymore, and it seems to have gone
offline.  I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
2024-08-17 10:57:22 -05:00
Dustin c7e4baa466 v-m: scrape: Remove nvr2.p.b Zincati scrape target
I've redeployed *nvr2.pyrocufflink.blue* as Fedora Linux, so it does not
run Zincati anymore.
2024-08-17 10:56:06 -05:00
Dustin 1a631bf366 v-m: scrape: Remove serial1.p.b
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them).  I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.

The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
2024-08-17 10:54:21 -05:00
Dustin 6f7f09de85 v-m: scrape: Update Unifi server target
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*.  The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
2024-08-17 10:52:51 -05:00
Dustin 809676f691 v-m: alerts: Add Longhorn alerts 2024-08-17 10:51:13 -05:00
Dustin 78cd26c827 v-m: Scrape metrics from RabbitMQ 2024-07-26 20:59:00 -05:00
Dustin 8cb292a4b2 v-m: alerts: Add alert for temperatures
After the incident this week with the CPU overheating on _vmhost1_, I
want to make sure I know as soon as possible when anything is starting
to get too hot.
2024-07-11 22:07:27 -05:00
Dustin 8113e5a47f v-m: Fix syntax in AlertManager config
The `group_by` field takes a list of label names, rather than a single
string.
2024-07-06 07:13:27 -05:00
Dustin 952ab9f264 v-m: alertmanager: Group camera notifications
When Frigate is down, multiple alerts are generated for each camera, as
Home Assistant creates camera entities for each tracked object.  This is
extremely annoying, not to mention unnecessary.  To address this, we'll
configure AlertManager to send a single notification for alerts in the
group.
2024-07-05 07:30:30 -05:00
Dustin 9b26753e73 v-m: alerts: Add durations to spammy alerts
Let's avoid sending alerts immediately when something is unavailable,
because the issue might be transient and will resolve itself shortly.
2024-07-05 07:23:38 -05:00
Dustin 248a9a5ae9 v-m: Scrape PostgreSQL exporter
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server.  It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.

[0]: https://github.com/prometheus-community/postgres_exporter
2024-07-02 18:16:05 -05:00
Dustin a8ef4c7a80 v-m: Add component labels to configmaps
Adding a `component` label to each ConfigMap will make it possible to
target them specifically, e.g. with `kubectl apply -l`.
2024-07-02 18:16:05 -05:00
Dustin 65e53ad16d v-m: Scrape Zinciti metrics from K8s nodes
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS.  We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
2024-07-02 18:16:05 -05:00
Dustin 2d7fec1cdf v-m: vmstorage: Add pod anti-affinity
One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
2024-06-26 18:29:49 -05:00
Dustin f7f408ca8c v-m: Redo vmstorage persistent volumes
Longhorn does not work well for very large volumes.  It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently.  This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.

Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well.  Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these.  Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.

The migration process was relatively straightforwrad.  I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database.  Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs.  Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online.  Once everything was up and running, I re-imported
the exported data.

[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
2024-06-26 18:29:49 -05:00
Dustin ab458df415 v-m/vmstorage: Start pods in parallel
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one.  If there is a problem starting
that pod, e.g. data corruption, then the others will never start.  This
sort of defeats the purpose of having multiple replicas.  Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod.  This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
2024-06-26 18:29:49 -05:00
Dustin 14be633843 v-m: Scrape Restic exporter 2024-06-26 18:29:49 -05:00
Dustin 1c4b32925e v-m: Use dynamic discovery for some collectd nodes
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
2024-06-26 18:29:49 -05:00
Dustin b8015c0bed v-m: blackbox: Force TCP probe to IPv4
Since I added an IPv6 ULA prefix to the "main" VLAN (to allow
communicating with the Synology directly), the domain controllers now
have AAAA records.  This causes the `sambadc` screpe job to fail because
Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not
have IPv6 addreses.
2024-06-26 18:29:49 -05:00
Dustin 48f20eac07 v-m: Scrape metrics from fleetlock 2024-05-31 15:18:55 -05:00
Dustin 8939c1d02c v-m/scrape: Scrape unifi2.p.b
*unifi2.pyrocufflink.blue* is a Fedora CoreOS host, so it runs
*collectd*, *Promtail*, and *Zincati*.
2024-05-26 11:48:59 -05:00
Dustin 3b74c3d508 v-m: Scrape metrics from Paperless-ngx Flower 2024-05-22 15:51:07 -05:00
Dustin d5bfdaca25 v-m/alertmanager-ntfy: Add labels to notifications
Just having the alert name and group name in the ntfy notification is
not enough to really indicate what the problem is, as some alerts can
generate notifications for many reasons.  In the email notifications
AlertManager sends by default, the values (but not the keys) of all
labels are included in the subject, so we will reproduce that here.
2024-05-22 15:20:27 -05:00
Dustin d74e26d527 victoria-metrics: Send alerts via ntfy
I don't like having alerts sent by e-mail.  Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time.  They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly).  I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.

Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager.  Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other.  There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format.  Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.

[0]: https://github.com/alexbakker/alertmanager-ntfy
2024-05-10 10:32:52 -05:00
Dustin ebea31fe55 v-m: alerts: Add alert for camera offline 2024-04-23 09:42:04 -05:00
Dustin 1581a620ef v-m/scrape: Scrape nvr2.p.b
*nvr2.pyrocufflink.blue* has replaced *nvr1.pyrocufflink.blue* as the
Frigate/recording server.
2024-04-10 21:25:26 -05:00
Dustin de72776e73 v-m: Scrape metrics from Authelia
Authelia exposes Prometheus metrics from a different server socket,
which is not enabled by default.
2024-02-27 06:41:52 -06:00
Dustin e0b2b3f5ae v-m: Scrape metrics from Patroni
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages.  Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
2024-02-24 08:33:52 -06:00
Dustin 83eeb46c93 v-m: Scrape Argo CD
*Argo CD* exposes metrics about itself and the applications it manages.
Notibly, this can be useful for monitoring application health.
2024-02-22 07:10:01 -06:00
Dustin 465f121e61 v-m: Scrape Promtail
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.

The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
2024-02-22 07:10:01 -06:00
Dustin 5e4ab1d988 v-m: Update Loki scrape target
Now that Loki uses Caddy as a reverse proxy, we need to update the
scrape target to point to the correct port (443).
2024-02-22 07:10:01 -06:00
Dustin 4c238a69aa v-m: Scrape Grafana Loki
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*.  It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
2024-02-21 09:16:26 -06:00
Dustin 2acefd9a72 v-m: Add alert for sensor battery levels
I did not realize the batteries on the garage door tilt sensors had
died.  Adding alerts for various sensor batteries should help keep me
better informed.
2024-02-16 20:56:38 -06:00
Dustin 1f28a623ae v-m: Do not scrape/alert on Graylog
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it.  I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
2024-02-01 21:45:43 -06:00
Dustin 834d0f804f v-m: Scrape Grafana
Grafana exports Prometheus metrics about its own performance.
2024-02-01 09:02:01 -06:00
Dustin 8ae8bad112 v-m: Scrape serial1.p.b 2024-01-25 20:42:07 -06:00
Dustin ad37948fe2 v-m: Scrape all metrics components
We are now getting metrics from *vmstorage*, *vminsert*, *vmselect*,
*vmalert*, *alertmanaer*, and *blackbox-exporter*, in addition to
*vmagent*.
2024-01-23 11:51:50 -06:00
Dustin bcb588407d v-m: Correct vmalert remote read/write URLs
*vmalert* has been generating alerts and triggering notifications, but
not writing any `ALERTS`/`ALERTS_FOR_STATE` metrics.  It turns out this
is because I had not correctly configured the remote read/write
URLs.
2024-01-23 10:45:40 -06:00
Dustin 119a8a74ae v-m: alerts: Enhance Frigate unavailable alert
If Frigate is running but not connected to the MQTT broker, the
`sensor.frigate_status` entity will be available, but the
`update.frigate_server` entity will not.
2024-01-22 18:27:30 -06:00
Dustin 54e7a25f93 v-m: vmstorage: Remove startup/ready probes
Kubernetes will not start additional Pods in a StatefulSet until the
existing ones are Ready.  This means that if there is a problem bringing
up, e.g. `vmstorage-0`, it will never start `vmstorage-1` or
`vmstorage-2`.  Since this pretty much defeats the purpose of having a
multi-node `vmstorage` cluster, we have to remove the readiness probe,
so the Pods will be Ready as soon as they start.  If there is a problem
with one of them, it will matter less, as the others can still run.
2024-01-22 16:43:46 -06:00
Dustin ca02dfec62 v-m: Add host labels to collectd-virt metrics
The *virt* plugin for *collectd* sets `instance` to the name of the
libvirt domain the metric refers to.  This makes it so there is no label
identifying which host the VM is running on.  Thus, if we want to
classify metrics by VM host, we need to add that label explicitly.

Since the `__address__` label is not available during metric relabeling,
we need to store it in a temporary label, which gets dropped at the end
of the relabeling phase.  We copy the value of that label into a new
label, but only for metrics that match the desired metric name.
2024-01-22 11:12:19 -06:00
Dustin 51775ede81 v-m/vmagent: Scrape nut0
*nut0.pyrocufflink.blue* is the new UPS monitor server.  It runs Fedora
CoreOS, with NUT in a container.
2024-01-15 18:46:46 -06:00
Dustin 90b293d5c8 v-m/vmagent: Scrape k8s-amd64-n3 2024-01-15 18:45:52 -06:00
Dustin 278be05121 v-m/blackbox: Switch to upstream container image
I found the official container image for Prometheus Blackbox exporter.
It is hosted on Quay, which is why I didn't see it on Docker Hub when I
looked initially.
2024-01-15 18:45:25 -06:00
Dustin 539e25d9bd v-m/vmagent: Scrape public clouds to test Internet
Scraping the public DNS servers doesn't work anymore since the firewall
routes traffic through Mullvad.  Pinging public cloud providers should
give a pretty decent indication of Internet connectivity.  It will also
serve as a benchmark for the local DNS performance, since the names will
have to be resolved.
2024-01-15 18:44:46 -06:00
Dustin 98cdcdfe30 v-m/scrape: Stable instance label for Longhorn
By default, the `instance` label for discovered metrics targets is set
to the scrape address.  For Kubernetes pods, that is the IP address and
port of the pod, which naturally changes every time the pod is recreated
or moved.  This will cause a high churn rate for Longhorn manager pods.
To avoid this, we set the `instance` label to the name of the node the
pod is running on, which will not change because the Longhorn manager
pods are managed by a DaemonSet.
2024-01-04 09:16:20 -06:00
Dustin bac7de72f2 v-m: Scrape Longhorn manager metrics
Each Longhorn manager pod exports metrics about the node on which it is
running.  Thus, we have to scrape every pod to get the metrics about the
whole ecosystem.
2024-01-02 11:27:31 -06:00
Dustin 225fd8469c v-m/vmagent: Allow listing all pods in cluster
The original RBAC configuration allowed `vmagent` only to list the pods
in the `victoria-metrics` namespace.  In order to allow it to monitor
other applications' pods, it needs to be assigned permission to list
pods in all namespaces.
2024-01-02 11:25:54 -06:00
Dustin 8f088fb6ae v-m: Deploy (clustered) Victoria Metrics
Since *mtrcs0.pyrocufflink.blue* (the Metrics Pi) seems to be dying,
I decided to move monitoring and alerting into Kubernetes.

I was originally planning to have a single, dedicated virtual machine
for Victoria Metrics and Grafana, similar to how the Metrics Pi was set
up, but running Fedora CoreOS instead of a custom Buildroot-based OS.
While I was working on the Ignition configuration for the VM, it
occurred to me that monitoring would be interrupted frequently, since
FCOS updates weekly and all updates require a reboot.  I would rather
not have that many gaps in the data.  Ultimately I decided that
deploying a cluster with Kubernetes would probably be more robust and
reliable, as updates can be performed without any downtime at all.

I chose not to use the Victoria Metrics Operator, but rather handle
the resource definitions myself.  Victoria Metrics components are not
particularly difficult to deploy, so the overhead of running the
operator and using its custom resources would not be worth the minor
convenience it provides.
2024-01-01 17:48:10 -06:00