kubernetes

infra

Author	SHA1	Message	Date
Dustin	7158ff89df	v-m/alerts: Ignore Restic alert for Purple Pi The Purple Pi is no more. We want to keep it's backups around, though, but we don't need alerts about them.	2025-09-12 07:25:21 -05:00
Dustin	87331b24b0	v-m/alerts: Ignore Restic alert for bw0 _bw0.pyrocufflink.blue_ has been decommissioned since some time, so it doesn't get backed up any more. We want to keep its previous backups around, though, in case we ever need to restore something. This triggers the "no recent backups" alert, since the last snapshot is over a week old. Let's ignore that hostname when generating this alert.	2025-09-07 08:27:19 -05:00
Dustin	1ec974fa2d	v-m/alerts: Add alert for Internet down	2025-08-03 11:29:41 -05:00
Dustin	38ee60e099	v-m: Add alerts for Firefly, Paperless, phpipam _Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we have to scrape them via the Blackbox Exporter. Paperless-ngx only exposes metrics via Flower, but since it runs in the same container as the main application, we can assume that if the former is unavailable, the latter is as well.	2025-07-27 17:39:28 -05:00
Dustin	dc835ddc9d	v-m/alerts: Fix PostgreSQL WAL archive failed alert The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL archival has failed, it will increase and never return to `0`. To ensure the alert is resolved once the WAL archival process recovers, we need to use the `increase` function to turn it into a gauge. Finally, we aggregate that gauge with `max_over_time` to keep the alert from flapping if the WAL archive occurs less frequently than the scrape interval.	2025-02-05 10:42:35 -06:00
Dustin	a87b53e3ac	v-m: Add alert for Frigate camera no video At some point this week, the front porch camera stopped sending video. I'm not sure exactly what happened to it, but Frigate kept logging "Unable to read frames from ffmpeg process." I power-cycled the camera, which resolved the issue. Unfortunately, no alerts were generated about this situation. Home Assistant did not consider the camera entity unavailable, presumably because Frigate was still reporting stats about it. Thus, I missed several important notifications. To avoid this in the future, I have enabled the "Camera FPS" sensors for all of the cameras in Home Assistant, and added this alert to trigger when the reported framerate is 0. I really also need to get alerts for log events configured, as that would also indicated there was an issue.	2025-02-01 18:16:10 -06:00
Dustin	b9d69ec0a3	v-m/alerts: Ignore missing backups from Toad, Luma Toad and Luma can go offline for several days at a time if I don't use them. I don't need an alert telling me this.	2024-12-21 12:23:19 -06:00
Dustin	a03d63841d	v-m/alerts: Fire paperless email alert after 12h We don't need a notification about paperless not scheduling email tasks every time there is a gap in the metric. This can happen in some innocuous situations like when the pod restarts or if there is a brief disruption of service. Using the `absent_over_time` function with a range vector, we can have the alert fire only if there have been no email tasks scheduled within the last 12 hours.	2024-12-21 12:17:45 -06:00
Dustin	d04c18cfcd	v-m/alerts: Remove 'no file changes' alert It turns out this alert is not very useful, and indeed quite annoying. Many servers can go for days or even weeks with no changes, which is completely normal.	2024-12-21 12:14:11 -06:00
Dustin	9f287d0f71	v-m/alerts: Add alerts for backup RAID array Just like I did with the RAID-1 array in the old BURP server, I will keep one member active and one in the fireproof safe, swapping them each month. We can use the same metrics queries to alert on when the swap should happen that we used with the BURP server.	2024-11-04 20:46:03 -06:00
Dustin	d76a1360c8	v-m/alerts: Ignore Paperless consume_file task Paperless-ngx uses a Celery task to process uploaded files, converting them to PDF, running OCR, etc. This task can be marked as "failed" for various reasons, most of which are more about the document itself than the health of the application. The GUI displays the results of failed tasks when they occur. It doesn't really make sense to have an alert about this scenario, especially since there's nothing to do to directly clear the alert anyway.	2024-11-04 20:28:11 -06:00
Dustin	8ecee4133f	v-m/alerts: Rework free disk space alert Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal circumstances on aarch64 machines. This is not a problem, because it cleans up old files on its own, so we do not need to alert on it. Unfortunately, the _DiskUsage_ alert is already quite complex, and adding in exclusions for these devices would make it even worse. To simplify the logic, we can use a recording rule to precomupte the used/free space ratio. By using `sum(...) without (type)` instead of `sum(...) on (df, instance)`, we keep the other labels, which we can then use to identify the metrics coming from machines we don't care to monitor. Instead of having different thresholds for different volumes encoded in the same expression, we can use multiple alerts to alert on "low" vs "very low" thresholds. Since this will of course cause duplicate alerts for most volumes, we can use AlertManager inhibition rules to disable the "low" alert once the metric crosses the "very low" threshold.	2024-11-02 09:38:02 -05:00
Dustin	4cef41688f	v-m/alerts: Add Zigbee+ZWave network alerts	2024-11-01 18:14:56 -05:00
Dustin	0101040634	v-m/alerts: Add Paperless-ngx email task alert This alert should fire if the background task to fetch e-mail and import them into Paperless-ngx has not run for a while.	2024-11-01 18:04:06 -05:00
Dustin	3f9601dc94	v-m/alerts: Improve Paperless-ngx Celery task alert The `flower_events_total` metric is a counter, so its value only ever increases (discounting restarts of the server process). As such, nonzero values do not necessarily indicate a _current_ problem, but rather that there was one at some point in the past. To identify current issues, we need to use the `increase` function, and then apply the `max_over_time` function so that the alert doesn't immediately reset itself.	2024-11-01 18:00:50 -05:00
Dustin	d12e66f58a	v-m: Scrape Frigate exporter	2024-11-01 17:47:51 -05:00
Dustin	e19e8f50ab	v-m/alerts: Add alerts for Paperless-ngx	2024-10-17 07:18:23 -05:00
Dustin	78651eb5f8	v-m/alerts: Add alerts for PostgreSQL WAL archiver	2024-10-17 07:18:09 -05:00
Dustin	ee3e078b20	v-m/alerts: Add alerts for Restic backups	2024-10-17 06:58:48 -05:00
Dustin	f182479d34	v-m: Remove BURP metrics, alerts BURP is officially decommissioned, replaced by Restic.	2024-09-05 20:16:01 -05:00
Dustin	809676f691	v-m: alerts: Add Longhorn alerts	2024-08-17 10:51:13 -05:00
Dustin	8cb292a4b2	v-m: alerts: Add alert for temperatures After the incident this week with the CPU overheating on _vmhost1_, I want to make sure I know as soon as possible when anything is starting to get too hot.	2024-07-11 22:07:27 -05:00
Dustin	9b26753e73	v-m: alerts: Add durations to spammy alerts Let's avoid sending alerts immediately when something is unavailable, because the issue might be transient and will resolve itself shortly.	2024-07-05 07:23:38 -05:00
Dustin	ebea31fe55	v-m: alerts: Add alert for camera offline	2024-04-23 09:42:04 -05:00
Dustin	e0b2b3f5ae	v-m: Scrape metrics from Patroni Patroni, a component of the postgres poerator, exports metrics about the PostgreSQL database servers it manages. Notably, it provides information about the current transaction log location for each server. This allows us to monitor and alert on the health of database replicas.	2024-02-24 08:33:52 -06:00
Dustin	2acefd9a72	v-m: Add alert for sensor battery levels I did not realize the batteries on the garage door tilt sensors had died. Adding alerts for various sensor batteries should help keep me better informed.	2024-02-16 20:56:38 -06:00
Dustin	1f28a623ae	v-m: Do not scrape/alert on Graylog Graylog is down because Elasticsearch corrupted itself again, and this time, I'm just not going to bother fixing it. I practically never use it anymore anyway, and I want to migrate to Grafana Loki, so now seems like a good time to just get rid of it.	2024-02-01 21:45:43 -06:00
Dustin	119a8a74ae	v-m: alerts: Enhance Frigate unavailable alert If Frigate is running but not connected to the MQTT broker, the `sensor.frigate_status` entity will be available, but the `update.frigate_server` entity will not.	2024-01-22 18:27:30 -06:00
Dustin	8f088fb6ae	v-m: Deploy (clustered) Victoria Metrics Since mtrcs0.pyrocufflink.blue (the Metrics Pi) seems to be dying, I decided to move monitoring and alerting into Kubernetes. I was originally planning to have a single, dedicated virtual machine for Victoria Metrics and Grafana, similar to how the Metrics Pi was set up, but running Fedora CoreOS instead of a custom Buildroot-based OS. While I was working on the Ignition configuration for the VM, it occurred to me that monitoring would be interrupted frequently, since FCOS updates weekly and all updates require a reboot. I would rather not have that many gaps in the data. Ultimately I decided that deploying a cluster with Kubernetes would probably be more robust and reliable, as updates can be performed without any downtime at all. I chose not to use the Victoria Metrics Operator, but rather handle the resource definitions myself. Victoria Metrics components are not particularly difficult to deploy, so the overhead of running the operator and using its custom resources would not be worth the minor convenience it provides.	2024-01-01 17:48:10 -06:00

29 Commits (updatebot/authelia)