From dc835ddc9df807a66571b7cf7b270a34f903bba4 Mon Sep 17 00:00:00 2001 From: "Dustin C. Hatch" Date: Wed, 5 Feb 2025 10:42:35 -0600 Subject: [PATCH] v-m/alerts: Fix PostgreSQL WAL archive failed alert The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL archival has failed, it will increase and never return to `0`. To ensure the alert is resolved once the WAL archival process recovers, we need to use the `increase` function to turn it into a gauge. Finally, we aggregate that gauge with `max_over_time` to keep the alert from flapping if the WAL archive occurs less frequently than the scrape interval. --- victoria-metrics/alerts.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/victoria-metrics/alerts.yml b/victoria-metrics/alerts.yml index 3d3909c..ebc8a28 100644 --- a/victoria-metrics/alerts.yml +++ b/victoria-metrics/alerts.yml @@ -185,7 +185,9 @@ groups: for: 10m - alert: WAL archive process failed expr: >- - pg_stat_archiver_failed_count > 0 + max_over_time( + increase(pg_stat_archiver_failed_count)[20m] + )> 0 annotations: summary: The archiver process failed for one or more WAL segments description: >-