kubernetes

infra

Author	SHA1	Message	Date
Dustin	3f39747557	v-m: Redo Internet/DNS connectivity checks (again) Using domain names in the "blackbox" probe makes it difficult to tell the difference between a complete Internet outage and DNS issues. I switched to using these names when I changed how the firewall routed traffic to the public DNS servers, since those were the IP addresses I was using to determine if the Internet was "up." I think it makes sense, though, to just ping the upstream gateway for that check. If EverFast changes their routing or numbering, we'll just have to update our checks to match.	2024-09-22 12:06:03 -05:00
Dustin	f182479d34	v-m: Remove BURP metrics, alerts BURP is officially decommissioned, replaced by Restic.	2024-09-05 20:16:01 -05:00
Dustin	78afee9abc	v-m/scrape: Remove static VM hosts from collectd The VM hosts are now managed by the "main" Ansible inventory and thus appear in the host list ConfigMap. As such, they do not need to be listed explicitly in the static targets list.	2024-08-23 09:28:05 -05:00
Dustin	02001f61db	v-m/scrape: webistes: Stop scraping Matrix I'm not using Matrix for anything anymore, and it seems to have gone offline. I haven't fully decommissioned it yet, but the Blackbox scrape is failing, so I'll just disable that bit for now.	2024-08-17 10:57:22 -05:00
Dustin	c7e4baa466	v-m: scrape: Remove nvr2.p.b Zincati scrape target I've redeployed nvr2.pyrocufflink.blue as Fedora Linux, so it does not run Zincati anymore.	2024-08-17 10:56:06 -05:00
Dustin	1a631bf366	v-m: scrape: Remove serial1.p.b This machine never worked correctly; the USB-RS232 adapters would stop working randomly (and of course it would be whenever I needed to actually use them). I thought it was something wrong with the server itself (a Raspberry Pi 3), but the same thing happened when I tried using a Pi 4. The new backup server has a plethora of on-board RS-232 ports, so I'm going to use it as the serial console server, too.	2024-08-17 10:54:21 -05:00
Dustin	6f7f09de85	v-m: scrape: Update Unifi server target I've rebuilt the Unifi Network controller machine (again); unifi3.pyrocufflink.blue has replaced unifi2.p.b. The `unifi_exporter` no longer works with the latest version of Unifi Network, so it's not deployed on the new machine.	2024-08-17 10:52:51 -05:00
Dustin	78cd26c827	v-m: Scrape metrics from RabbitMQ	2024-07-26 20:59:00 -05:00
Dustin	248a9a5ae9	v-m: Scrape PostgreSQL exporter The [postgres exporter][0] exposes metrics about the operation and performance of a PostgreSQL server. It's currently deployed on _db0.pyrocufflink.blue_, the primary server of the main PostgreSQL cluster. [0]: https://github.com/prometheus-community/postgres_exporter	2024-07-02 18:16:05 -05:00
Dustin	65e53ad16d	v-m: Scrape Zinciti metrics from K8s nodes All the Kubernetes nodes (except k8s-ctrl0) are now running Fedora CoreOS. We can therefore use the Kubernetes API to discover scrape targets for the Zincati job.	2024-07-02 18:16:05 -05:00
Dustin	14be633843	v-m: Scrape Restic exporter	2024-06-26 18:29:49 -05:00
Dustin	1c4b32925e	v-m: Use dynamic discovery for some collectd nodes We don't need to explicitly specify every single host individually. Domain controllers, for example, are registered in DNS with SRV records. Kubernetes nodes, of course, can be discovered using the Kubernetes API. Both of these classes of nodes change frequently, so discovering them dynamically is convenient.	2024-06-26 18:29:49 -05:00
Dustin	48f20eac07	v-m: Scrape metrics from fleetlock	2024-05-31 15:18:55 -05:00
Dustin	8939c1d02c	v-m/scrape: Scrape unifi2.p.b unifi2.pyrocufflink.blue is a Fedora CoreOS host, so it runs collectd, Promtail, and Zincati.	2024-05-26 11:48:59 -05:00
Dustin	3b74c3d508	v-m: Scrape metrics from Paperless-ngx Flower	2024-05-22 15:51:07 -05:00
Dustin	d74e26d527	victoria-metrics: Send alerts via ntfy I don't like having alerts sent by e-mail. Since I don't get e-mail notifications on my watch, I often do not see alerts for quite some time. They are also much harder to read in an e-mail client (Fastmail web an K-9 Mail both display them poorly). I would much rather have them delivered via _ntfy_, just like all the rest of the ephemeral notifications I receive. Fortunately, it is easy enough to integrate Alertmanager and _ntfy_ using the webhook notifier in Alertmanager. Since _ntfy_ does not natively support the Alertmanager webhook API, though, a bridge is necessary to translate from one data format to the other. There are a few options for this bridge, but I chose [alexbakker/alertmanager-ntfy][0] because it looked the most complete while also having the simplest configuration format. Sadly, it does not expose any Prometheus metrics itself, and since it's deployed in the _victoria-metrics_ namespace, it needs to be explicitly excluded from the VMAgent scrape configuration. [0]: https://github.com/alexbakker/alertmanager-ntfy	2024-05-10 10:32:52 -05:00
Dustin	1581a620ef	v-m/scrape: Scrape nvr2.p.b nvr2.pyrocufflink.blue has replaced nvr1.pyrocufflink.blue as the Frigate/recording server.	2024-04-10 21:25:26 -05:00
Dustin	de72776e73	v-m: Scrape metrics from Authelia Authelia exposes Prometheus metrics from a different server socket, which is not enabled by default.	2024-02-27 06:41:52 -06:00
Dustin	e0b2b3f5ae	v-m: Scrape metrics from Patroni Patroni, a component of the postgres poerator, exports metrics about the PostgreSQL database servers it manages. Notably, it provides information about the current transaction log location for each server. This allows us to monitor and alert on the health of database replicas.	2024-02-24 08:33:52 -06:00
Dustin	83eeb46c93	v-m: Scrape Argo CD Argo CD exposes metrics about itself and the applications it manages. Notibly, this can be useful for monitoring application health.	2024-02-22 07:10:01 -06:00
Dustin	465f121e61	v-m: Scrape Promtail The promtail job scrapes metrics from all the hosts running Promtail. The static targets are Fedora CoreOS nodes that are not part of the Kubernetes cluster. The relabeling rules ensure that both the static targets and the targets discovered via the Kubernetes Node API use the FQDN of the host as the value of the instance label.	2024-02-22 07:10:01 -06:00
Dustin	5e4ab1d988	v-m: Update Loki scrape target Now that Loki uses Caddy as a reverse proxy, we need to update the scrape target to point to the correct port (443).	2024-02-22 07:10:01 -06:00
Dustin	4c238a69aa	v-m: Scrape Grafana Loki Grafana Loki is hosted on a VM named loki0.pyrocufflink.blue. It runs Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape _collectd_ and _Zincati_ as well.	2024-02-21 09:16:26 -06:00
Dustin	1f28a623ae	v-m: Do not scrape/alert on Graylog Graylog is down because Elasticsearch corrupted itself again, and this time, I'm just not going to bother fixing it. I practically never use it anymore anyway, and I want to migrate to Grafana Loki, so now seems like a good time to just get rid of it.	2024-02-01 21:45:43 -06:00
Dustin	834d0f804f	v-m: Scrape Grafana Grafana exports Prometheus metrics about its own performance.	2024-02-01 09:02:01 -06:00
Dustin	8ae8bad112	v-m: Scrape serial1.p.b	2024-01-25 20:42:07 -06:00
Dustin	ad37948fe2	v-m: Scrape all metrics components We are now getting metrics from vmstorage, vminsert, vmselect, vmalert, alertmanaer, and blackbox-exporter, in addition to vmagent.	2024-01-23 11:51:50 -06:00
Dustin	ca02dfec62	v-m: Add host labels to collectd-virt metrics The virt plugin for collectd sets `instance` to the name of the libvirt domain the metric refers to. This makes it so there is no label identifying which host the VM is running on. Thus, if we want to classify metrics by VM host, we need to add that label explicitly. Since the `__address__` label is not available during metric relabeling, we need to store it in a temporary label, which gets dropped at the end of the relabeling phase. We copy the value of that label into a new label, but only for metrics that match the desired metric name.	2024-01-22 11:12:19 -06:00
Dustin	51775ede81	v-m/vmagent: Scrape nut0 nut0.pyrocufflink.blue is the new UPS monitor server. It runs Fedora CoreOS, with NUT in a container.	2024-01-15 18:46:46 -06:00
Dustin	90b293d5c8	v-m/vmagent: Scrape k8s-amd64-n3	2024-01-15 18:45:52 -06:00
Dustin	539e25d9bd	v-m/vmagent: Scrape public clouds to test Internet Scraping the public DNS servers doesn't work anymore since the firewall routes traffic through Mullvad. Pinging public cloud providers should give a pretty decent indication of Internet connectivity. It will also serve as a benchmark for the local DNS performance, since the names will have to be resolved.	2024-01-15 18:44:46 -06:00
Dustin	98cdcdfe30	v-m/scrape: Stable instance label for Longhorn By default, the `instance` label for discovered metrics targets is set to the scrape address. For Kubernetes pods, that is the IP address and port of the pod, which naturally changes every time the pod is recreated or moved. This will cause a high churn rate for Longhorn manager pods. To avoid this, we set the `instance` label to the name of the node the pod is running on, which will not change because the Longhorn manager pods are managed by a DaemonSet.	2024-01-04 09:16:20 -06:00
Dustin	bac7de72f2	v-m: Scrape Longhorn manager metrics Each Longhorn manager pod exports metrics about the node on which it is running. Thus, we have to scrape every pod to get the metrics about the whole ecosystem.	2024-01-02 11:27:31 -06:00
Dustin	8f088fb6ae	v-m: Deploy (clustered) Victoria Metrics Since mtrcs0.pyrocufflink.blue (the Metrics Pi) seems to be dying, I decided to move monitoring and alerting into Kubernetes. I was originally planning to have a single, dedicated virtual machine for Victoria Metrics and Grafana, similar to how the Metrics Pi was set up, but running Fedora CoreOS instead of a custom Buildroot-based OS. While I was working on the Ignition configuration for the VM, it occurred to me that monitoring would be interrupted frequently, since FCOS updates weekly and all updates require a reboot. I would rather not have that many gaps in the data. Ultimately I decided that deploying a cluster with Kubernetes would probably be more robust and reliable, as updates can be performed without any downtime at all. I chose not to use the Victoria Metrics Operator, but rather handle the resource definitions myself. Victoria Metrics components are not particularly difficult to deploy, so the overhead of running the operator and using its custom resources would not be worth the minor convenience it provides.	2024-01-01 17:48:10 -06:00

34 Commits (3f3974755703b5ba72a82a9e9f784b90e99f64cb)