Prometheus & Grafana: Monitoring Your Infrastructure
Build comprehensive monitoring with Prometheus for metrics collection and Grafana for visualization. Learn alerting, PromQL, and production monitoring patterns.
Monitoring is not optional — it’s how you know your systems are working. Prometheus collects metrics, Grafana visualizes them, and Alertmanager tells you when things break. This guide covers the patterns that work in production.
Architecture Overview
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Your Apps │────▶│ Prometheus │────▶│ Grafana │
│ (metrics) │ │ (scrape) │ │ (visualize)│
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│Alertmanager │
│ (notify) │
└─────────────┘
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape
evaluation_interval: 15s # How often to evaluate rules
external_labels:
cluster: production
region: us-east-1
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (system metrics)
- job_name: 'node'
static_configs:
- targets:
- 'node-exporter-1:9100'
- 'node-exporter-2:9100'
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}
# Use custom path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Instrumenting Applications
Node.js with prom-client
// metrics.js
const client = require('prom-client');
// Collect default metrics (CPU, memory, etc.)
client.collectDefaultMetrics({ prefix: 'myapp_' });
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new client.Gauge({
name: 'http_active_connections',
help: 'Number of active HTTP connections'
});
// Middleware to record metrics
function metricsMiddleware(req, res, next) {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route: route,
status_code: res.statusCode
};
httpRequestDuration.observe(labels, duration);
httpRequestsTotal.inc(labels);
activeConnections.dec();
});
next();
}
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
Python with prometheus_client
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from functools import wraps
import time
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[.01, .05, .1, .5, 1, 5]
)
IN_PROGRESS = Gauge(
'http_requests_in_progress',
'HTTP requests in progress',
['method', 'endpoint']
)
def track_requests(func):
@wraps(func)
def wrapper(*args, **kwargs):
method = request.method
endpoint = request.endpoint or 'unknown'
IN_PROGRESS.labels(method=method, endpoint=endpoint).inc()
start = time.time()
try:
response = func(*args, **kwargs)
status = response.status_code if hasattr(response, 'status_code') else 200
return response
except Exception as e:
status = 500
raise
finally:
duration = time.time() - start
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(duration)
IN_PROGRESS.labels(method=method, endpoint=endpoint).dec()
return wrapper
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': 'text/plain'}
PromQL Queries
Rate and Increase
# Request rate per second (last 5 minutes)
rate(http_requests_total[5m])
# Request rate by status code
sum(rate(http_requests_total[5m])) by (status_code)
# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
Latency Percentiles
# P50 latency
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99 latency by route
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (route, le)
)
Resource Usage
# CPU usage percentage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)
Aggregations
# Sum across all instances
sum(http_requests_total)
# Average by job
avg(http_request_duration_seconds) by (job)
# Max by instance
max(node_memory_MemTotal_bytes) by (instance)
# Top 5 endpoints by request count
topk(5, sum(rate(http_requests_total[1h])) by (endpoint))
Alert Rules
# /etc/prometheus/rules/alerts.yml
groups:
- name: application
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (> 5%)"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value | humanizeDuration }}"
- name: infrastructure
rules:
# Instance down
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }}/{{ $labels.instance }} has been down for > 1 minute"
# High CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}%"
# Disk space low
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Only {{ $value | humanize }}% disk space remaining"
# Memory pressure
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanize }}%"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
continue: true # Also send to Slack
# High-priority alerts
- match:
severity: warning
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: '{{ .CommonLabels.severity }}'
inhibit_rules:
# Don't alert on warning if critical is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana Dashboards
Dashboard as Code (JSON)
{
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status_code)",
"legendFormat": "{{status_code}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"title": "Latency (P95)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P95"
}
],
"fieldConfig": {
"defaults": {
"unit": "s"
}
}
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
}
]
}
Provisioning Dashboards
# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: 'Default'
folder: 'General'
type: file
options:
path: /var/lib/grafana/dashboards
Recording Rules
Pre-compute expensive queries:
# /etc/prometheus/rules/recording.yml
groups:
- name: http_metrics
interval: 15s
rules:
# Pre-compute request rate
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Pre-compute error rate
- record: job:http_errors:rate5m
expr: sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
# Pre-compute latency percentiles
- record: job:http_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
- record: job:http_latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
Kubernetes Deployment
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.47.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
- name: data
mountPath: /prometheus
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
- name: data
persistentVolumeClaim:
claimName: prometheus-data
Key Takeaways
- Instrument everything — you can’t fix what you can’t see
- Use histograms for latency — averages hide problems
- Alert on symptoms, not causes — users care about errors, not CPU
- Set meaningful thresholds — too many alerts = ignored alerts
- Use recording rules — pre-compute expensive queries
- Label wisely — high cardinality kills performance
- Retain data appropriately — storage costs money
“Monitoring is not about collecting data. It’s about turning data into action.”