Prometheus Alerting: Writing Rules That Actually Work
How to write effective Prometheus alerting rules — avoiding alert fatigue, using recording rules, and building runbooks that help on-call engineers.
Prometheus Alerting: Writing Rules That Actually Work
Bad alerting is worse than no alerting. Alert fatigue — where engineers learn to ignore pages because most of them are noise — is one of the most common failure modes in on-call culture. The solution isn’t fewer alerts; it’s better ones.
Here’s how to write Prometheus alerting rules that signal real problems without burning out your team.
Alert Rule Anatomy
groups:
- name: application.rules
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
Key fields:
expr: The PromQL expression. Must evaluate to a truthy (non-zero) value to fire.for: How long the condition must hold before alerting. Prevents flapping on transient spikes.labels: Added to the alert — used by Alertmanager for routing.annotations: Human-readable context, not used for routing.
The Four Golden Signals
Focus alerting on what users experience:
Latency — How slow are requests?
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
labels:
severity: warning
Traffic — Are we getting requests? Sudden drops can mean incidents.
- alert: TrafficDrop
expr: sum(rate(http_requests_total[5m])) < 10
for: 5m
labels:
severity: warning
Errors — What fraction of requests fail?
- alert: ErrorRateCritical
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
Saturation — How close to full are our resources?
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
for: 15m
labels:
severity: warning
Recording Rules: Performance and Reuse
Complex PromQL expressions are expensive to evaluate repeatedly — especially at high cardinality. Recording rules pre-compute expressions and store results as new time series.
groups:
- name: request_metrics
interval: 30s
rules:
# Pre-compute error rate
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
# Pre-compute total rate
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- name: application.alerts
rules:
# Now the alert is cheap — uses pre-computed series
- alert: HighErrorRate
expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05
Naming convention for recording rules: level:metric:operations.
The for Clause: Avoiding Flapping
Without for, any single evaluation that returns a truthy value fires the alert. With noisy metrics, this causes constant firing and resolving.
for: 5m # Condition must be true for 5 consecutive minutes
Choosing the right duration:
- Critical, fast-moving (disk full, crash loop):
for: 1morfor: 2m - Performance degradation:
for: 5mtofor: 10m - Capacity planning warnings:
for: 30mor longer
Too long: real problems are missed. Too short: alert fatigue.
Alert Severity Levels
Define clear severity levels and stick to them:
| Severity | Meaning | Notification |
|---|---|---|
critical | User impact now, immediate action required | PagerDuty, wake someone up |
warning | Degraded but not down, investigate soon | Slack, business hours |
info | Noteworthy but not actionable | Ticket, no alert |
labels:
severity: critical # critical | warning | info
Alertmanager routes on labels:
# alertmanager.yml
route:
group_by: [alertname, severity]
receiver: slack-warnings
routes:
- match:
severity: critical
receiver: pagerduty-critical
- match:
severity: warning
receiver: slack-warnings
Inhibition: Suppress Downstream Noise
When a node is down, you’ll get 20 alerts about all the pods on that node. Inhibition suppresses child alerts when a parent alert fires:
inhibit_rules:
- source_match:
alertname: NodeDown
target_match_re:
alertname: (PodCrashLooping|HighLatency|HighErrorRate)
equal: [node]
Runbooks: Making Alerts Actionable
An alert without a runbook is just noise with a name. Every alert should have a runbook_url annotation linking to documentation that answers:
- What does this alert mean? Plain-language explanation.
- What are the common causes? List them.
- How do I investigate? Step-by-step queries and commands.
- How do I resolve it? Mitigation steps.
- Who else should I contact? Escalation path.
Template:
annotations:
summary: "{{ $labels.job }} error rate above 5%"
description: |
Service {{ $labels.job }} has an error rate of {{ printf "%.2f" $value }}%.
Threshold: 5%. Duration: 5m.
runbook_url: "https://runbooks.internal/high-error-rate"
Testing Alert Rules
Use promtool to unit-test your alerting rules:
# tests/alert_tests.yml
rule_files:
- ../rules/application.rules.yml
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{status="500", job="api"}'
values: "0+10x30" # 10 req/s for 30 minutes
- series: 'http_requests_total{status="200", job="api"}'
values: "0+190x30" # 190 req/s for 30 minutes
alert_rule_test:
- eval_time: 15m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
severity: critical
job: api
promtool test rules tests/alert_tests.yml
Testing prevents broken alerting from making it to production. Your on-call rotation will thank you.
Anti-Patterns to Avoid
Alerting on symptoms, not causes — alert on user-visible error rate, not on CPU usage (which may or may not matter).
Thresholds without context — “CPU > 80%” means nothing without knowing what’s normal for that service.
Missing for clauses — transient metric spikes generate alert noise.
Alerts without runbooks — if the engineer can’t do anything with the alert at 3 AM, it shouldn’t page them.
Too many critical alerts — when everything is critical, nothing is.
Good alerting is opinionated and minimal. You want your on-call engineers to trust every page they receive. That trust is earned one well-tuned rule at a time.