1
0
Fork 0

Compare commits

..

11 Commits

Author SHA1 Message Date
bot b956e9ac05 authelia: Update to 4.38.17 2024-11-09 12:32:16 +00:00
Dustin 3d40424cf7 fleetlock: Use patched server from Github PR
The _fleetlock_ server drains all pods from a node before allocating the
reboot lock to that node.  Unfortunately, it doesn't actually wait for
those pods to be completely evicted.  If some pods take too long to shut
down, they may get stuck in `Terminating` state once the machine starts
rebooting.  This makes it so those pods cannot be replaced on another
node with the original one is offline, which pretty much defeats the
purpose of using Fleetlock in the first place.

It seems upstream has abandoned this project, as there is an open [Pull
Request][0] to fix this issue that has so far been ignored.
Fortunately, building a new container image containing the patch is easy
enough, so we can run our own patched build.

[0]: https://github.com/poseidon/fleetlock/pull/271
2024-11-05 07:05:55 -06:00
Dustin ac62a77c96 Merge branch '20125' 2024-11-05 07:05:19 -06:00
Dustin e1d9833e83 cert-manager: Add cert for apps.du5t1n.xyz 2024-11-05 07:04:27 -06:00
Dustin 4ad5518f18 cert-manager: Migrate config to configMapGenerator 2024-11-05 07:04:09 -06:00
Dustin 9f287d0f71 v-m/alerts: Add alerts for backup RAID array
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month.  We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
2024-11-04 20:46:03 -06:00
Dustin 2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
Dustin db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
Dustin d76a1360c8 v-m/alerts: Ignore Paperless consume_file task
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc.  This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application.  The GUI displays the results of failed
tasks when they occur.  It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
2024-11-04 20:28:11 -06:00
Dustin 71b52e4c6f 20125: Deploy Status server
https://20125.home/ is the URL the Status Android application loads in
its main WebView.  This site is powered by a server that generates a
custom page showing the status of our self-hosted applications, based on
alerts retrieved from the AlertManager API.

Android WebView does not allow cleartext HTTP connections.  It does,
however, allow connecting an HTTPS server and ignoring the certificate
it presents, which is effectively the same thing.  Thus, we generate a
self-signed certificate for the Ingress for this site.
2024-11-02 19:51:53 -05:00
Dustin 8ecee4133f v-m/alerts: Rework free disk space alert
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines.  This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.

To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio.  By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.

Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds.  Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
2024-11-02 09:38:02 -05:00
17 changed files with 409 additions and 51 deletions

79
20125/config.yml Normal file
View File

@ -0,0 +1,79 @@
alertmanager:
url: http://alertmanager.victoria-metrics:9093
system_wide:
alerts:
- alertgoup: Active Directory
- alertgoup: Longhorn
- alertgoup: PostgreSQL
- alertgoup: Restic
- alertgoup: Temperature
- job: authelia
- job: blackbox
- job: dns_pyrocufflink
- job: dns_recursive
- job: kubelet
- job: kubernetes
- instance: db0.pyrocufflink.blue
- instance: gw1.pyrocufflink.blue
- instance: vmhost0.pyrocufflink.blue
- instance: vmhost1.pyrocufflink.blue
applications:
- name: Home Assistant
url: https://homeassistant.pyrocufflink.blue/
icon:
url: icons/home-assistant.svg
alerts:
- alertgroup: Home Assistant
- alertgroup: Frigate
- job: homeassistant
- instance: homeassistant.pyrocufflink.blue
- name: Nextcloud
url: &url https://nextcloud.pyrocufflink.net/
icon:
url: icons/nextcloud.png
alerts:
- instance: *url
- instance: cloud0.pyrocufflink.blue
- name: Invoice Ninja
url: &url https://invoiceninja.pyrocufflink.net/
icon:
url: icons/invoiceninja.svg
class: light-bg
alerts:
- instance: *url
- name: Jellyfin
url: &url https://jellyfin.pyrocufflink.net/
icon:
url: icons/jellyfin.svg
alerts:
- instance: *url
- name: Vaultwarden
url: &url https://bitwarden.pyrocufflink.net/
icon:
url: icons/vaultwarden.svg
class: light-bg
alerts:
- instance: *url
- alertgroup: Bitwarden
- name: Paperless-ngx
url: &url https://paperless.pyrocufflink.blue/
icon:
url: icons/paperless-ngx.svg
alerts:
- instance: *url
- alertgroup: Paperless-ngx
- job: paperless-ngx
- name: Firefly III
url: &url https://firefly.pyrocufflink.blue/
icon:
url: icons/firefly-iii.svg
alerts:
- instance: *url

25
20125/ingress.yaml Normal file
View File

@ -0,0 +1,25 @@
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
cert-manager.io/issuer: status-server-ca
labels: &labels
app.kubernetes.io/name: status-server
name: status-server
spec:
tls:
- hosts:
- 20125.home
secretName: status-server-cert
rules:
- host: 20125.home
http:
paths:
- backend:
service:
name: status-server
port:
number: 80
path: /
pathType: Prefix

26
20125/kustomization.yaml Normal file
View File

@ -0,0 +1,26 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: '20125'
labels:
- pairs:
app.kubernetes.io/instance: '20125'
app.kubernetes.io/part-of: '20125'
includeSelectors: true
resources:
- namespace.yaml
- secrets.yaml
- status-server-ca.yaml
- status-server.yaml
- ingress.yaml
configMapGenerator:
- name: 20125-config
files:
- config.yml
images:
- name: git.pyrocufflink.net/packages/20125.home
newTag: dev

6
20125/namespace.yaml Normal file
View File

@ -0,0 +1,6 @@
apiVersion: v1
kind: Namespace
metadata:
name: "20125"
labels:
app.kubernetes.io/name: '20125'

13
20125/secrets.yaml Normal file
View File

@ -0,0 +1,13 @@
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: imagepull-gitea
namespace: "20125"
spec:
encryptedData:
.dockerconfigjson: AgAXY5XsnGF9E3RU9dnKXK9fWjqK0khCDf7n0vuJCLACrcM01aoWVSjl26+j7oSTTyc7t5C+EKPJnuKdlkNfh2Omw9Lh3dn8rPRYcBRmUEAyt0TvBVkxBIiP+49y39QEV1opYY+b1gLVJ5ZEC92u5uI9y8xovwx9wqtKLfQ+KCfc5m93AYaQJ9EcnV1DSEkv/HdtWNikQes2hO6pTLF/GHrh/s79eIeXMTm5oG/OyJWTOGQdy1SdoGWLLf31dsjhsJyGMYOtWx42Nou20lWqmdoy4Dd8OXuuhcfeDNzkH187mI4XpVjbS0M+P5teJsGiTwx+VyJlGQnEaquIiHy3KLt3YH/ltGeNeCNbFmSDa70A3IdP/t0cAXN20rlIFGVzqNrAOhMYtiTDEgaKOrL7mwM4i4NTCnLTA2nXU7gLEcLGPRqO7LKIhc1/6d1xWMT58SFjHAVklFt/lq1udY6zE8gXHp+RQ+7hIIEu500YiaKubvh2MsOKIqYOaX99Q4BW7PQhwjjwtFHFuwNjZn8wAbDq+3gsDSqgeFPgHAs7nPIImcBne/fTobsHhUVvxEnBNLCRtSqpkvOLzpgC+dRNsD6ZTcXPhFWTEOvjBMcUWqOTcRmd8DCsdxalM42x/ZQjlNlubZeuaNki+4pA80bYlsLWt3A2nWtcVbO/aAYrT1qiK/d8NZsPNidD0HE1rkUkCNv5KgXVWUfVU7ptX1YFpYXXuEIFeWzulH3gWmdW4q+t2nGHAqwkszZfijtpsexBttff1ym3rgBTGHFmQRkmSMbNHIAq29ehuVrxkH7uM8Q1cXXmMnGgre0ijtUfW9zMlx92jR86187xOLM3/hxANhfyt4eZVMwx8D42facMxxAngCi01vYTwqihA9mtBFkKlkQdKCH1NxgWQqwAJi87utgHoFivxeM+Pck7Zeottr0yzUEisdoBAdQR99hijR2C5SnC4iURnqfi9sloj0Uuo74SxiTGapA7pg77LmvpV9Wzu6QiEm944tftcZHwaMg=
template:
metadata:
name: imagepull-gitea
namespace: "20125"
type: kubernetes.io/dockerconfigjson

View File

@ -0,0 +1,32 @@
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: selfsigned-ca
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: status-server-ca
spec:
isCA: true
commonName: 20125 CA
secretName: status-server-ca-secret
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-ca
kind: Issuer
group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: status-server-ca
spec:
ca:
secretName: status-server-ca-secret

46
20125/status-server.yaml Normal file
View File

@ -0,0 +1,46 @@
apiVersion: v1
kind: Service
metadata:
labels: &labels
app.kubernetes.io/name: status-server
app.kubernetes.io/component: status-server
name: status-server
spec:
ports:
- port: 80
protocol: TCP
targetPort: 20125
selector: *labels
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels: &labels
app.kubernetes.io/name: status-server
app.kubernetes.io/component: status-server
name: status-server
spec:
replicas: 1
selector:
matchLabels: *labels
template:
metadata:
labels: *labels
spec:
containers:
- name: status-server
image: git.pyrocufflink.net/packages/20125.home
imagePullPolicy: Always
volumeMounts:
- mountPath: /usr/local/share/20125.home/config.yml
name: config
subPath: config.yml
readOnly: True
imagePullSecrets:
- name: imagepull-gitea
volumes:
- name: config
configMap:
name: 20125-config

View File

@ -0,0 +1,41 @@
git_repo: gitea@git.pyrocufflink.blue:dustin/certs.git
certs:
- name: pyrocufflink-cert
namespace: default
key: certificates/_.pyrocufflink.net.key
cert: certificates/_.pyrocufflink.net.crt
bundle: certificates/_.pyrocufflink.net.pem
- name: dustinhatchname-cert
namespace: default
key: acme.sh/dustin.hatch.name/dustin.hatch.name.key
cert: acme.sh/dustin.hatch.name/fullchain.cer
- name: hatchchat-cert
namespace: default
key: certificates/hatch.chat.key
cert: certificates/hatch.chat.crt
bundle: certificates/hatch.chat.pem
- name: tabitha-cert
namespace: default
key: certificates/tabitha.biz.key
cert: certificates/tabitha.biz.crt
bundle: certificates/tabitha.biz.pem
- name: chmod777-cert
namespace: default
key: certificates/chmod777.sh.key
cert: certificates/chmod777.sh.crt
bundle: certificates/chmod777.sh.pem
- name: dustinandtabitha-cert
namespace: default
key: certificates/dustinandtabitha.com.key
cert: certificates/dustinandtabitha.com.crt
bundle: certificates/dustinandtabitha.com.pem
- name: hlc-cert
namespace: default
key: certificates/hatchlearningcenter.org.key
cert: certificates/hatchlearningcenter.org.crt
bundle: certificates/hatchlearningcenter.org.pem
- name: appsxyz-cert
namespace: default
key: certificates/apps.du5t1n.xyz.key
cert: certificates/apps.du5t1n.xyz.crt
bundle: certificates/apps.du5t1n.xyz.pem

View File

@ -4,51 +4,6 @@ metadata:
name: cert-exporter
namespace: cert-manager
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cert-exporter
namespace: cert-manager
data:
config.yml: |
git_repo: gitea@git.pyrocufflink.blue:dustin/certs.git
certs:
- name: pyrocufflink-cert
namespace: default
key: certificates/_.pyrocufflink.net.key
cert: certificates/_.pyrocufflink.net.crt
bundle: certificates/_.pyrocufflink.net.pem
- name: dustinhatchname-cert
namespace: default
key: acme.sh/dustin.hatch.name/dustin.hatch.name.key
cert: acme.sh/dustin.hatch.name/fullchain.cer
- name: hatchchat-cert
namespace: default
key: certificates/hatch.chat.key
cert: certificates/hatch.chat.crt
bundle: certificates/hatch.chat.pem
- name: tabitha-cert
namespace: default
key: certificates/tabitha.biz.key
cert: certificates/tabitha.biz.crt
bundle: certificates/tabitha.biz.pem
- name: chmod777-cert
namespace: default
key: certificates/chmod777.sh.key
cert: certificates/chmod777.sh.crt
bundle: certificates/chmod777.sh.pem
- name: dustinandtabitha-cert
namespace: default
key: certificates/dustinandtabitha.com.key
cert: certificates/dustinandtabitha.com.crt
bundle: certificates/dustinandtabitha.com.pem
- name: hlc-cert
namespace: default
key: certificates/hatchlearningcenter.org.key
cert: certificates/hatchlearningcenter.org.crt
bundle: certificates/hatchlearningcenter.org.pem
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
@ -69,6 +24,7 @@ rules:
- chmod777-cert
- dustinandtabitha-cert
- hlc-cert
- appsxyz-cert
---
apiVersion: rbac.authorization.k8s.io/v1

View File

@ -136,3 +136,20 @@ spec:
privateKey:
algorithm: ECDSA
rotationPolicy: Always
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: appsxyz-cert
spec:
secretName: appsxyz-cert
dnsNames:
- apps.du5t1n.xyz
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: zerossl
privateKey:
algorithm: ECDSA
rotationPolicy: Always

View File

@ -8,6 +8,14 @@ resources:
- cert-exporter.yaml
- dch-ca-issuer.yaml
configMapGenerator:
- name: cert-exporter
namespace: cert-manager
files:
- config.yml=cert-exporter.config.yml
options:
disableNameSuffixHash: True
secretGenerator:
- name: zerossl-eab
namespace: cert-manager

View File

@ -19,3 +19,8 @@ patches:
name: fleetlock
spec:
clusterIP: 10.96.1.15
images:
- name: quay.io/poseidon/fleetlock
newName: git.pyrocufflink.net/containerimages/fleetlock
newTag: vadimberezniker-wait_evictions

View File

@ -31,3 +31,12 @@ route:
- alertgroup=Frigate
group_by:
- alertname
inhibit_rules:
- source_matchers:
- alertname=Free disk space is very low
target_matchers:
- alertname=Free disk space is low
equal:
- instance
- df

View File

@ -1,12 +1,35 @@
groups:
- name: default alert
rules:
- alert: DiskUsage
- alert: Free disk space is low
expr: >-
sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df!="var-log", df!="var-lib-frigate"}) by (instance, df) > .75
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-log"}) by (instance, df) > .95
or sum(collectd_df_df_complex{type!="free"}) by (instance, df) / sum(collectd_df_df_complex{df="var-lib-frigate"}) by (instance, df) > .95
(
filesystem:usage:percent{
kubernetes_io_arch!="arm64",
df!="mmcblk0p3",
df!="var-lib-frigate",
df!="var-log",
}
or
filesystem:usage:percent{
kubernetes_io_arch="arm64",
df!="boot",
}
or
filesystem:usage:percent{
df="mmcblk0p3",
instance!="nut0.pyrocufflink.blue",
}
) > .75
for: 2h
annotations:
severity: minor
- alert: Free disk space is very low
expr: >-
filesystem:usage:percent > 0.9
for: 2h
annotations:
severity: minor
- alert: TheWebsiteIsDown
expr: >-
probe_success{job="websites"} == 0
@ -37,10 +60,43 @@ groups:
- name: mdraid
rules:
- alert: mdraid missing disk
expr: collectd_md_md_disks{type="missing", instance!~"burp.*"} != 0
expr: collectd_md_md_disks{type="missing", instance!="chromie.pyrocufflink.blue"} != 0
- alert: mdraid failed disk
expr: collectd_md_md_disks{type="failed"} != 0
- name: Backups
rules:
- alert: disks need swapped
expr:
time() - tlast_change_over_time(
(
collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type="active"}
or last_over_time(collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type="active"})[1d]
)[90d]
) > 86400 * 30
annotations:
summary: The disks in the backup array need swapped
description: >-
The disks in the backup RAID-1 (mirror) array should be swapped
periodically. One disk should be online and mounted while the other
is stored in the fireproof safe. Switching them ensures that even if
something happens to the active disk, such as hardware failure, power
surge, fire, or accidental `rm -rf`, the offline disk is only out of
date by a few weeks.
- alert: disk needs archived
expr:
sum(
collectd_md_md_disks{instance="chromie.pyrocufflink.blue", type=~"missing|spare"}
) < 1
annotations:
summary: One of the disks in the backup array should be archived
description: >-
The disks in the backup RAID-1 (mirror) array should be swapped
periodically. One disk should be online and mounted while the other
is stored in the fireproof safe. All of the disks are currently
online; one needs to be disconnected and moved to the safe as soon as
possible.
- name: certificates
rules:
- alert: certificate will expire soon
@ -202,7 +258,11 @@ groups:
expr: >-
max_over_time(
increase(
flower_events_total{job="paperless-ngx", type="task-failed"}
flower_events_total{
job="paperless-ngx",
type="task-failed",
task!="documents.tasks.consume_file",
}
)[24h]
) > 0
annotations:

View File

@ -38,6 +38,7 @@ configMapGenerator:
- name: vmalert-rules
files:
- alerts.yml
- recording.yml
options:
disableNameSuffixHash: true
labels:

View File

@ -0,0 +1,8 @@
groups:
- name: collectd
rules:
- record: filesystem:usage:percent
expr: >-
sum without (type) (collectd_df_df_complex{type!="free"})
/ sum without (type) (collectd_df_df_complex)

View File

@ -88,6 +88,9 @@ scrape_configs:
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__meta_kubernetes_node_name]
regex: .*\.compute\.internal$
action: drop
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels:
@ -258,6 +261,9 @@ scrape_configs:
- source_labels: [__meta_kubernetes_node_name]
regex: k8s-ctrl0.pyrocufflink.blue
action: drop
- source_labels: [__meta_kubernetes_node_name]
regex: .*\.compute\.internal$
action: drop
- source_labels: [__meta_kubernetes_node_name]
regex: '(.+)'
target_label: __address__
@ -304,6 +310,9 @@ scrape_configs:
- role: pod
label: app.kubernetes.io/name=promtail
relabel_configs:
- source_labels: [__meta_kubernetes_node_name]
regex: .*\.compute\.internal$
action: drop
- source_labels: [__address__]
target_label: instance
- source_labels: [__meta_kubernetes_pod_node_name]
@ -485,3 +494,20 @@ scrape_configs:
- source_labels: [__address__]
target_label: __address__
replacement: '$1:8118'
- job_name: jellyfin
scheme: https
dns_sd_configs:
- names:
- jellyfin.pyrocufflink.blue
type: A
port: 443
relabel_configs:
- source_labels:
- __meta_dns_name
- __meta_dns_srv_record_port
separator: ':'
target_label: __address__
- source_labels:
- __meta_dns_name
target_label: instance