1
0
Fork 0

v-m/alerts: Fix PostgreSQL WAL archive failed alert

The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL
archival has failed, it will increase and never return to `0`.  To
ensure the alert is resolved once the WAL archival process recovers, we
need to use the `increase` function to turn it into a gauge.  Finally,
we aggregate that gauge with `max_over_time` to keep the alert from
flapping if the WAL archive occurs less frequently than the scrape
interval.
pull/50/head
Dustin 2025-02-05 10:42:35 -06:00
parent f637feba16
commit dc835ddc9d
1 changed files with 3 additions and 1 deletions

View File

@ -185,7 +185,9 @@ groups:
for: 10m for: 10m
- alert: WAL archive process failed - alert: WAL archive process failed
expr: >- expr: >-
pg_stat_archiver_failed_count > 0 max_over_time(
increase(pg_stat_archiver_failed_count)[20m]
)> 0
annotations: annotations:
summary: The archiver process failed for one or more WAL segments summary: The archiver process failed for one or more WAL segments
description: >- description: >-