Monitoring March 21, 2026 ⏱ 7 min read

Chaos Engineering: Breaking Things on Purpose to Build Resilient Systems

How chaos engineering—from Netflix's Chaos Monkey to production game days—helps teams find failure modes before they find you.

chaos-engineeringresiliencereliabilitysrechaos-monkeygame-days

Chaos Engineering: Breaking Things on Purpose to Build Resilient Systems

In 2010, Netflix moved its entire infrastructure to AWS and immediately discovered a terrifying truth: distributed systems fail in unpredictable ways. Servers die. Networks partition. Dependencies become unavailable. The question wasn’t if these things would happen — it was when, and whether they’d take down Netflix when they did.

Their answer: build a tool that randomly kills production servers. If your system can survive random server deaths every day in production, you know it can handle the real thing.

That tool was Chaos Monkey. The discipline it spawned is chaos engineering.

What Is Chaos Engineering?

The Chaos Engineering definition from the Principles of Chaos:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Three key words: experimenting, confidence, production.

You’re not breaking things randomly and hoping for the best. You’re running controlled experiments with a hypothesis, measuring outcomes, and learning from the results. The goal is building evidence that your system is resilient — not just assuming it is.

The Chaos Experiment Model

A chaos experiment follows the scientific method:

1. Define Steady State

What does “healthy” look like? This should be measurable:

p99 response time < 200ms
Error rate < 0.1%
Order completion rate > 99.9%

Behavioral metrics matter more than system metrics. “CPU < 80%” doesn’t capture user impact; “checkout success rate > 99%” does.

2. Form a Hypothesis

“If we kill one instance of the payment service, the system will maintain > 99% checkout success rate because we have 3 instances and auto-scaling enabled.”

3. Introduce the Variable (the chaos)

Kill the instance. Inject network latency. Consume all available memory. Corrupt a DNS response.

4. Observe and Measure

Did steady state hold? What happened to the metrics? Were there cascading failures you didn’t expect?

5. Learn and Fix

If the hypothesis held: document the evidence, increase confidence. If it didn’t: you found a real vulnerability before a production incident did.

Types of Chaos Experiments

Infrastructure Chaos

Instance/Pod termination: Kill servers, pods, or containers randomly.

# Kill a random pod in the payment namespace
kubectl get pods -n payment -o name | shuf -n 1 | xargs kubectl delete

Resource exhaustion: Consume CPU, memory, or disk.

# stress-ng: consume all available memory
stress-ng --vm 1 --vm-bytes 100% --timeout 60s

Network failure: Introduce latency, packet loss, or partitions.

# tc: add 200ms latency with 20ms jitter to eth0
tc qdisc add dev eth0 root netem delay 200ms 20ms

Application Chaos

Dependency failure: Make a downstream service return errors or time out.

Clock skew: Change system time to test time-sensitive logic (caching, token expiry, rate limiting).

Process hang: Pause a process without killing it — simulates a frozen service.

# chaos-toolkit action: inject HTTP errors into a service
{
    "type": "action",
    "name": "inject-http-errors",
    "provider": {
        "type": "http",
        "url": "http://chaosmonkey.internal/inject",
        "method": "POST",
        "payload": {
            "target": "payment-service",
            "error_rate": 0.5,
            "duration_seconds": 60
        }
    }
}

Data and State Chaos

Database failover: Force a primary database to fail and verify replication + failover works.

Cache invalidation: Flush Redis and verify the system degrades gracefully rather than collapsing.

Message queue delay: Introduce processing lag in Kafka/SQS consumers.

Tools

Chaos Monkey (Netflix)

The original. Randomly terminates EC2 instances in production. Simple, effective, terrifying to deploy the first time.

Chaos Toolkit

Open-source Python framework for writing portable chaos experiments as JSON/YAML. Integrates with Kubernetes, AWS, GCP, Prometheus, and more.

# chaos-toolkit experiment
title: Verify payment service survives pod termination
description: Kill one payment pod; verify checkout rate stays > 99%
steady-state-hypothesis:
  title: Services are available
  probes:
    - type: probe
      name: checkout-success-rate
      provider:
        type: prometheus
        url: http://prometheus:9090
        query: sum(rate(checkout_success_total[5m])) / sum(rate(checkout_total[5m]))
      tolerance: ">= 0.99"
method:
  - type: action
    name: terminate-payment-pod
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: "app=payment"
        ns: production
        qty: 1

LitmusChaos

Kubernetes-native chaos framework. Experiments defined as CRDs, executed by operators. Good GitOps integration.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-kill
spec:
  appinfo:
    appns: production
    applabel: "app=payment"
    appkind: deployment
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "60"
            - name: CHAOS_INTERVAL
              value: "10"
            - name: FORCE
              value: "false"

AWS Fault Injection Simulator (FIS)

Managed chaos engineering for AWS. Inject failures into EC2, ECS, EKS, RDS, and more with IAM-controlled blast radius.

Gremlin

Commercial SaaS with a polished UI. Good for teams that want guardrails and don’t want to operate their own chaos infrastructure.

Game Days

A game day is a planned, team-wide chaos event. Instead of automated random failures, the whole team participates:

Announce the scenario (or don’t, for more realistic results)
Inject failures at a controlled time
Observe: Does the on-call rotation detect the issue? How quickly? What runbooks are used?
Review: Timeline of events, what worked, what didn’t

Game days test not just system resilience but operational resilience — monitoring, alerting, communication, runbooks, and human decision-making under pressure.

A good game day scenario: “The primary RDS instance has been unavailable for 5 minutes. Your alerting has not fired. The CEO is asking why checkout is broken. Go.”

Starting Small: The Chaos Maturity Model

Don’t start by killing production databases. Build up gradually:

Level 0: No chaos engineering. Pure faith in testing and hope for good luck.

Level 1: Experiments in staging. Kill pods, introduce latency, verify behavior in a controlled environment.

Level 2: Automated experiments in production (off-peak). Run chaos experiments during low-traffic periods with automatic abort conditions.

Level 3: Continuous chaos in production. Chaos runs constantly, automatically, within defined blast radius limits. This is where Netflix operates.

Most organizations start at Level 1 and take 6-12 months to reach Level 2.

Abort Conditions

Every chaos experiment must have clear abort criteria:

abort_conditions:
  - metric: error_rate
    threshold: "> 1%"
    action: stop_experiment_and_rollback
  - metric: response_time_p99
    threshold: "> 500ms"
    action: stop_experiment_and_alert

If steady state is violated beyond acceptable bounds, stop the experiment automatically. Chaos engineering is about building confidence, not causing incidents.

What Chaos Engineering Reveals

Teams consistently discover the same categories of problems:

Missing circuit breakers: Slow dependencies cascade into full outages when there’s no fallback.

Inadequate timeouts: Services wait forever for responses, holding resources until the system saturates.

Single points of failure that weren’t obvious: That “highly available” system that actually depends on one config service with no replica.

Incomplete runbooks: On-call engineers don’t know what to do when a specific failure occurs.

Alert gaps: The system degrades significantly before any alert fires.

Recovery procedures that don’t work: Failover scripts that haven’t been tested and don’t actually fail over.

The Culture Shift

Chaos engineering requires psychological safety. If engineers fear blame for outages, they’ll never deliberately cause one — even in controlled conditions.

The practice requires a culture that:

Views experiments that expose vulnerabilities as successes (you found it before production did)
Doesn’t punish engineers for participating in chaos
Treats production incidents as learning opportunities, not failures of character

This is why chaos engineering and blameless postmortems go together. You can’t have one without the other.

The goal of chaos engineering isn’t to make systems fail — it’s to make teams confident their systems won’t. That confidence has to be earned through evidence, not assumed.

When the real outage comes (and it will), you want it to feel like a drill you’ve run before.