1
0
Fork 0
Commit Graph

54 Commits (28d6bdc3a9e7aabb811c2019fdca79560a7cd6c3)

Author SHA1 Message Date
Dustin 38ee60e099 v-m: Add alerts for Firefly, Paperless, phpipam
_Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we
have to scrape them via the Blackbox Exporter.

Paperless-ngx only exposes metrics via Flower, but since it runs in the
same container as the main application, we can assume that if the former
is unavailable, the latter is as well.
2025-07-27 17:39:28 -05:00
Dustin 093e909475 v-m/scrape: Scrape Victoria Logs 2025-07-06 15:20:16 -05:00
Dustin cc83a5115a v-m/scrape: Scrape MinIO metrics 2025-07-02 10:29:53 -05:00
Dustin fdb4bdb23d Merge branch 'unifi' 2025-06-21 14:00:38 -05:00
Dustin 75edfb74cb v-m/scrape: Increase timeout for k8s job
Scraping metrics from the Kubernetes API server has started taking 20+
seconds recondly.  Until I figure out the underlying cause, I'm
increasing the scrape timeout so that the _vmagent_ doesn't give up and
report the API server as "down."
2025-06-21 13:55:23 -05:00
Dustin 52094da8fd v-m/scrape: Remove unifi3, Zincati
*unifi3.pyrocufflink.blue* has been replaced by
*unifi-nuptials.host.pyrocufflink.black*.  The former was the last
Fedora CoreOS machine in use, so the entire Zincati scrape job is no
longer needed.
2025-03-29 08:10:50 -05:00
Dustin 6da330f2be v-m/scrape: Remove k8s SD config for Zincati
There are no more Kubernetes nodes running Fedora CoreOS.
2025-02-01 18:16:10 -06:00
Dustin 11a0f84db7 v-m/scrape: Remove websites job
Websites are being scraped by the `vmagent` on the OVH VPS.
2025-02-01 18:16:10 -06:00
Dustin 6e15b11f73 Merge branch 'fix-nextcloud-alert' 2024-12-21 11:58:41 -06:00
Dustin e0c633c21e v-m: scrape: Fix Nextcloud URL
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`.  The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application.  This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not.  To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
2024-11-17 18:43:00 +00:00
Dustin 0209f921c3 v-m: Remove nut0 from scrape targets
_nut0.pyrocufflink.blue_ is decommissioned.
2024-11-12 08:02:00 -06:00
Dustin 2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
Dustin db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
Dustin 6cf11f9f61 v-m: Scrape HAProxy 2024-11-01 18:14:37 -05:00
Dustin 7a768cbb76 v-m: Update jobs for new Loki server
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible.  Thus, it does not need to be
explicitly listed as a scrape target.

For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
2024-11-01 18:07:34 -05:00
Dustin d12e66f58a v-m: Scrape Frigate exporter 2024-11-01 17:47:51 -05:00
Dustin ea89e0cde4 v-m/scrape: Remove synapse job
The Synapse server is now completely decommissioned.
2024-10-17 06:50:27 -05:00
Dustin ffa47b9fba v-m: Scrape ntfy
_ntfy_ has supported Prometheus metrics for a while now, so let's
collect them.
2024-09-22 12:13:01 -05:00
Dustin 9ec6b651c1 v-m: Scrape wal-g via statsd_exporter
The database server now runs _statsd_exporter_, which receives metrics
from WAL-G whenever it saves WAL segments or creates backups.
2024-09-22 12:11:59 -05:00
Dustin c83ceee994 v-m: Quit scraping Jenkins with blackbox_exporter
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore.  Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
2024-09-22 12:10:03 -05:00
Dustin 3f39747557 v-m: Redo Internet/DNS connectivity checks (again)
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues.  I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up."  I think it makes
sense, though, to just ping the upstream gateway for that check.  If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
2024-09-22 12:06:03 -05:00
Dustin f182479d34 v-m: Remove BURP metrics, alerts
BURP is officially decommissioned, replaced by Restic.
2024-09-05 20:16:01 -05:00
Dustin 78afee9abc v-m/scrape: Remove static VM hosts from collectd
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap.  As such, they do not need to be
listed explicitly in the static targets list.
2024-08-23 09:28:05 -05:00
Dustin 02001f61db v-m/scrape: webistes: Stop scraping Matrix
I'm not using Matrix for anything anymore, and it seems to have gone
offline.  I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
2024-08-17 10:57:22 -05:00
Dustin c7e4baa466 v-m: scrape: Remove nvr2.p.b Zincati scrape target
I've redeployed *nvr2.pyrocufflink.blue* as Fedora Linux, so it does not
run Zincati anymore.
2024-08-17 10:56:06 -05:00
Dustin 1a631bf366 v-m: scrape: Remove serial1.p.b
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them).  I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.

The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
2024-08-17 10:54:21 -05:00
Dustin 6f7f09de85 v-m: scrape: Update Unifi server target
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*.  The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
2024-08-17 10:52:51 -05:00
Dustin 78cd26c827 v-m: Scrape metrics from RabbitMQ 2024-07-26 20:59:00 -05:00
Dustin 248a9a5ae9 v-m: Scrape PostgreSQL exporter
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server.  It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.

[0]: https://github.com/prometheus-community/postgres_exporter
2024-07-02 18:16:05 -05:00
Dustin 65e53ad16d v-m: Scrape Zinciti metrics from K8s nodes
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS.  We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
2024-07-02 18:16:05 -05:00
Dustin 14be633843 v-m: Scrape Restic exporter 2024-06-26 18:29:49 -05:00
Dustin 1c4b32925e v-m: Use dynamic discovery for some collectd nodes
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
2024-06-26 18:29:49 -05:00
Dustin 48f20eac07 v-m: Scrape metrics from fleetlock 2024-05-31 15:18:55 -05:00
Dustin 8939c1d02c v-m/scrape: Scrape unifi2.p.b
*unifi2.pyrocufflink.blue* is a Fedora CoreOS host, so it runs
*collectd*, *Promtail*, and *Zincati*.
2024-05-26 11:48:59 -05:00
Dustin 3b74c3d508 v-m: Scrape metrics from Paperless-ngx Flower 2024-05-22 15:51:07 -05:00
Dustin d74e26d527 victoria-metrics: Send alerts via ntfy
I don't like having alerts sent by e-mail.  Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time.  They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly).  I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.

Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager.  Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other.  There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format.  Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.

[0]: https://github.com/alexbakker/alertmanager-ntfy
2024-05-10 10:32:52 -05:00
Dustin 1581a620ef v-m/scrape: Scrape nvr2.p.b
*nvr2.pyrocufflink.blue* has replaced *nvr1.pyrocufflink.blue* as the
Frigate/recording server.
2024-04-10 21:25:26 -05:00
Dustin de72776e73 v-m: Scrape metrics from Authelia
Authelia exposes Prometheus metrics from a different server socket,
which is not enabled by default.
2024-02-27 06:41:52 -06:00
Dustin e0b2b3f5ae v-m: Scrape metrics from Patroni
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages.  Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
2024-02-24 08:33:52 -06:00
Dustin 83eeb46c93 v-m: Scrape Argo CD
*Argo CD* exposes metrics about itself and the applications it manages.
Notibly, this can be useful for monitoring application health.
2024-02-22 07:10:01 -06:00
Dustin 465f121e61 v-m: Scrape Promtail
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.

The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
2024-02-22 07:10:01 -06:00
Dustin 5e4ab1d988 v-m: Update Loki scrape target
Now that Loki uses Caddy as a reverse proxy, we need to update the
scrape target to point to the correct port (443).
2024-02-22 07:10:01 -06:00
Dustin 4c238a69aa v-m: Scrape Grafana Loki
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*.  It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
2024-02-21 09:16:26 -06:00
Dustin 1f28a623ae v-m: Do not scrape/alert on Graylog
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it.  I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
2024-02-01 21:45:43 -06:00
Dustin 834d0f804f v-m: Scrape Grafana
Grafana exports Prometheus metrics about its own performance.
2024-02-01 09:02:01 -06:00
Dustin 8ae8bad112 v-m: Scrape serial1.p.b 2024-01-25 20:42:07 -06:00
Dustin ad37948fe2 v-m: Scrape all metrics components
We are now getting metrics from *vmstorage*, *vminsert*, *vmselect*,
*vmalert*, *alertmanaer*, and *blackbox-exporter*, in addition to
*vmagent*.
2024-01-23 11:51:50 -06:00
Dustin ca02dfec62 v-m: Add host labels to collectd-virt metrics
The *virt* plugin for *collectd* sets `instance` to the name of the
libvirt domain the metric refers to.  This makes it so there is no label
identifying which host the VM is running on.  Thus, if we want to
classify metrics by VM host, we need to add that label explicitly.

Since the `__address__` label is not available during metric relabeling,
we need to store it in a temporary label, which gets dropped at the end
of the relabeling phase.  We copy the value of that label into a new
label, but only for metrics that match the desired metric name.
2024-01-22 11:12:19 -06:00
Dustin 51775ede81 v-m/vmagent: Scrape nut0
*nut0.pyrocufflink.blue* is the new UPS monitor server.  It runs Fedora
CoreOS, with NUT in a container.
2024-01-15 18:46:46 -06:00
Dustin 90b293d5c8 v-m/vmagent: Scrape k8s-amd64-n3 2024-01-15 18:45:52 -06:00