Prometheus Alerting: Writing Rules That Actually Work

Bad alerting is worse than no alerting. Alert fatigue — where engineers learn to ignore pages because most of them are noise — is one of the most common failure modes in on-call culture. The solution isn’t fewer alerts; it’s better ones.

Here’s how to write Prometheus alerting rules that signal real problems without burning out your team.

Alert Rule Anatomy

groups:
  - name: application.rules
    rules:
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes."
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

Key fields:

  • expr: The PromQL expression. Must evaluate to a truthy (non-zero) value to fire.
  • for: How long the condition must hold before alerting. Prevents flapping on transient spikes.
  • labels: Added to the alert — used by Alertmanager for routing.
  • annotations: Human-readable context, not used for routing.

The Four Golden Signals

Focus alerting on what users experience:

Latency — How slow are requests?

- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
  for: 10m
  labels:
    severity: warning

Traffic — Are we getting requests? Sudden drops can mean incidents.

- alert: TrafficDrop
  expr: sum(rate(http_requests_total[5m])) < 10
  for: 5m
  labels:
    severity: warning

Errors — What fraction of requests fail?

- alert: ErrorRateCritical
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

Saturation — How close to full are our resources?

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
  for: 15m
  labels:
    severity: warning

Recording Rules: Performance and Reuse

Complex PromQL expressions are expensive to evaluate repeatedly — especially at high cardinality. Recording rules pre-compute expressions and store results as new time series.

groups:
  - name: request_metrics
    interval: 30s
    rules:
      # Pre-compute error rate
      - record: job:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)

      # Pre-compute total rate
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

  - name: application.alerts
    rules:
      # Now the alert is cheap — uses pre-computed series
      - alert: HighErrorRate
        expr: job:http_errors:rate5m / job:http_requests:rate5m > 0.05

Naming convention for recording rules: level:metric:operations.

The for Clause: Avoiding Flapping

Without for, any single evaluation that returns a truthy value fires the alert. With noisy metrics, this causes constant firing and resolving.

for: 5m   # Condition must be true for 5 consecutive minutes

Choosing the right duration:

  • Critical, fast-moving (disk full, crash loop): for: 1m or for: 2m
  • Performance degradation: for: 5m to for: 10m
  • Capacity planning warnings: for: 30m or longer

Too long: real problems are missed. Too short: alert fatigue.

Alert Severity Levels

Define clear severity levels and stick to them:

SeverityMeaningNotification
criticalUser impact now, immediate action requiredPagerDuty, wake someone up
warningDegraded but not down, investigate soonSlack, business hours
infoNoteworthy but not actionableTicket, no alert
labels:
  severity: critical  # critical | warning | info

Alertmanager routes on labels:

# alertmanager.yml
route:
  group_by: [alertname, severity]
  receiver: slack-warnings
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
    - match:
        severity: warning
      receiver: slack-warnings

Inhibition: Suppress Downstream Noise

When a node is down, you’ll get 20 alerts about all the pods on that node. Inhibition suppresses child alerts when a parent alert fires:

inhibit_rules:
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: (PodCrashLooping|HighLatency|HighErrorRate)
    equal: [node]

Runbooks: Making Alerts Actionable

An alert without a runbook is just noise with a name. Every alert should have a runbook_url annotation linking to documentation that answers:

  1. What does this alert mean? Plain-language explanation.
  2. What are the common causes? List them.
  3. How do I investigate? Step-by-step queries and commands.
  4. How do I resolve it? Mitigation steps.
  5. Who else should I contact? Escalation path.

Template:

annotations:
  summary: "{{ $labels.job }} error rate above 5%"
  description: |
    Service {{ $labels.job }} has an error rate of {{ printf "%.2f" $value }}%.
    Threshold: 5%. Duration: 5m.
  runbook_url: "https://runbooks.internal/high-error-rate"

Testing Alert Rules

Use promtool to unit-test your alerting rules:

# tests/alert_tests.yml
rule_files:
  - ../rules/application.rules.yml

tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{status="500", job="api"}'
        values: "0+10x30"  # 10 req/s for 30 minutes
      - series: 'http_requests_total{status="200", job="api"}'
        values: "0+190x30" # 190 req/s for 30 minutes

    alert_rule_test:
      - eval_time: 15m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: critical
              job: api
promtool test rules tests/alert_tests.yml

Testing prevents broken alerting from making it to production. Your on-call rotation will thank you.

Anti-Patterns to Avoid

Alerting on symptoms, not causes — alert on user-visible error rate, not on CPU usage (which may or may not matter).

Thresholds without context — “CPU > 80%” means nothing without knowing what’s normal for that service.

Missing for clauses — transient metric spikes generate alert noise.

Alerts without runbooks — if the engineer can’t do anything with the alert at 3 AM, it shouldn’t page them.

Too many critical alerts — when everything is critical, nothing is.

Good alerting is opinionated and minimal. You want your on-call engineers to trust every page they receive. That trust is earned one well-tuned rule at a time.