Resilience Engineering · Field Handbook

Chaos
Engineering

// "If it hurts, do it more often — in a controlled way, before your users do it for you."

The practitioner's guide to intentional failure injection. Covers Gremlin, LitmusChaos, network fault experiments, resource exhaustion, gameday planning, and a maturity model for building provably resilient systems.

NIST SP 800-160

Gremlin · LitmusChaos

SRE Aligned

CNCF Landscape

Overview & Philosophy

// WHAT IS CHAOS ENGINEERING

Chaos Engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Popularized by Netflix's Simian Army, formalized in the 2016 "Principles of Chaos Engineering" manifesto, and now a CNCF project ecosystem, it is the engineering practice of proactively injecting controlled failure to discover weaknesses before they manifest as incidents.

The key word is experiment. Chaos Engineering is not random destruction. It is a scientific process: define steady state, propose a hypothesis, inject failure, observe, and learn. Every experiment should answer a specific question about system behavior under adverse conditions.

Break It First

Proactively cause failures in a controlled environment before your users cause them accidentally — at 3 AM, during peak traffic, with no runbook.

Learn From It

Every experiment produces knowledge. Either your hypothesis is confirmed (system is resilient), or you've discovered a real weakness — before it costs you uptime.

Harden It

Weaknesses become hardening tasks. Fix the failure mode. Re-run the experiment. Repeat until the system is provably resilient — not just presumably resilient.

🔥

Chaos ≠ Random: Breaking things randomly is sabotage. Chaos Engineering is a structured scientific method with defined scope, observability pre-requisites, rollback procedures, and explicit hypotheses. If you can't measure the impact, you're not doing chaos engineering — you're just breaking things.

Core Principles

// THE FIVE PRINCIPLES OF CHAOS ENGINEERING

The Principles of Chaos Engineering define five foundational rules, evolved from Netflix's decade of production experimentation.

PRINCIPLE-01

Build a Hypothesis Around Steady State

Focus on the measurable output of the system — not internal attributes. Steady state is any metric you can observe that represents normal operation: request throughput, error rate, p99 latency, orders per minute. Your hypothesis must be quantitative: "Steady state is maintained when error rate remains <0.1% and p99 latency remains <250ms."

PRINCIPLE-02

Vary Real-World Events

Inject events that reflect real failure modes: hardware failures, network partitions, dependency timeouts, traffic surges, malicious inputs. Don't invent fictional failure modes — model experiments on post-mortems and known failure patterns from your incident history and from industry reports (AWS us-east-1 outages, BGP leaks, memory leak patterns).

PRINCIPLE-03

Run Experiments in Production

Staging environments don't reflect production traffic patterns, data shapes, or dependency graphs. Experiments in staging teach you about staging. The goal is production confidence — which requires production experiments, with appropriate blast radius controls. Start with canary cohorts (e.g., 1% of hosts) and expand only after validation.

PRINCIPLE-04

Automate Experiments to Run Continuously

Systems change constantly — new deployments, dependency version bumps, config changes. A resilience property confirmed last month may not hold today. Integrate experiments into your CI/CD pipeline and run them on a schedule. A passing chaos experiment is a green build signal, same as a passing unit test.

PRINCIPLE-05

Minimize Blast Radius

The cost of discovery must always be less than the cost of a production incident. Scope every experiment precisely — limit target population (1 pod, 1 region, 1% of users), set a maximum duration, pre-configure automatic rollback, and have a human kill-switch always accessible. Expand scope only when smaller tests confirm safety.

Why Chaos Engineering

// THE CASE FOR INTENTIONAL FAILURE

Complex distributed systems have failure modes that cannot be reasoned about from first principles or detected by unit tests. The number of possible state combinations in a microservices architecture with 50+ services exceeds what any human or test suite can enumerate. The only way to know how the system behaves is to observe it under stress.

✕ Without Chaos Engineering

Unknown failure modes discovered during incidents
Resilience assumptions based on hope, not evidence
Dependency failures cause cascading outages
Runbooks untested — fail during the actual incident
Engineers fear production; deploys are risky rituals
MTTR measured in hours, not minutes
"We thought we had circuit breakers, but..."

✓ With Chaos Engineering

Failure modes discovered and remediated proactively
Resilience properties are measured, not assumed
Dependency failures are isolated behind tested circuit breakers
Runbooks executed under experiment conditions — battle-tested
Engineers build confidence through regular controlled failures
MTTR driven down through practiced recovery
Every incident produces a new chaos experiment

ℹ

Netflix origin: Netflix's Chaos Monkey (2010) was created specifically because their AWS migration eliminated physical hardware failures — which had previously forced resilience. They needed to simulate the failure modes they could no longer rely on infrastructure to provide. The Simian Army grew from there. Today, Netflix runs thousands of chaos experiments per year automatically.

The Scientific Method

// EXPERIMENT LIFECYCLE

Every chaos experiment follows a structured scientific method. Deviating from this — especially skipping observability setup or rollback planning — converts a controlled experiment into a production incident.

Step 01

Define Steady State

Baseline metrics, SLO thresholds

▶

Step 02

Hypothesize

Predict system behavior under fault

▶

Step 03

Design

Scope, targets, duration, abort criteria

▶

Step 04

Execute

Inject fault, observe, monitor abort signals

▶

Step 05

Analyze

Compare actual vs expected, document findings

Experiment Template — Structured Definition

Experiment: CHE-0042
Title:      "Payment service degrades gracefully when database replica fails"
Author:     platform-reliability@company.com
Date:       2024-08-15
Status:     APPROVED

STEADY STATE DEFINITION
  Metric:    payments.success_rate
  Baseline:  99.95% over 5 min rolling window (measured 1h pre-experiment)
  Threshold: Remains ≥ 99.80% throughout experiment
  Source:    Datadog dashboard: pay-svc-resilience

HYPOTHESIS
  "When the payments-db replica is terminated, the service automatically
   fails over to the primary within 30s, and success rate remains above
   the SLO threshold of 99.80% with no customer-visible errors beyond
   the 30s failover window."

SCOPE & TARGETING
  Environment:  Production (us-east-1 only)
  Target:       1 of 3 RDS read replicas (payments-db-replica-2)
  User Impact:  ~0% — replica handles read-only analytics queries only
  Duration:     10 minutes active fault + 10 minutes recovery observation

ABORT CRITERIA (kill switch triggers)
  - payments.success_rate drops below 99.0% for more than 60s
  - Any P1 page fires during the experiment
  - Primary DB connections exceed 80% of max_connections

ROLLBACK PROCEDURE
  1. RDS auto-failover promotes replica (automatic, ~30s)
  2. Manual: aws rds reboot-db-instance --db-instance-id payments-db-replica-2
  3. Notify: #platform-oncall in Slack immediately on abort

OBSERVABILITY CHECKLIST (must be green before starting)
  ☑ Datadog dashboard open and streaming
  ☑ PagerDuty muted for: payments-db-replica-health
  ☑ Oncall engineer watching: @sre-oncall
  ☑ Rollback command ready: terminal open, command staged

Hypothesis Design

// WRITING TESTABLE FAILURE HYPOTHESES

A well-formed chaos hypothesis has three components: a fault condition (what you inject), a predicted behavior (what the system should do), and a measurable outcome (how you verify the prediction was correct). Vague hypotheses produce uninterpretable results.

Quality	Example Hypothesis	Verdict
Weak	"The system will handle a database failure."	REJECT — no measurable outcome, no specific fault
Weak	"If we kill a pod, the service stays up."	WEAK — binary outcome, no quantitative threshold
Good	"When 1 of 3 api-gateway pods is terminated, Kubernetes schedules a replacement within 45s, and error rate remains <0.5% throughout."	ACCEPT — specific fault, quantitative threshold, time-bound
Good	"When 100ms of latency is added to all calls from payment-service to fraud-service, the circuit breaker opens within 10s and payments fall back to the allow-by-default path with 0 payment failures."	ACCEPT — fault, expected mechanism, measurable outcome
Excellent	"Under 200% of peak CPU load on 50% of cart-service instances, autoscaling provisions additional capacity within 90s, p99 checkout latency remains <800ms, and 0 shopping carts are lost."	EXCELLENT — scoped, multiple metrics, mechanism + outcome

The Hypothesis Formula

Hypothesis Structure

## Template
"When [FAULT CONDITION] is introduced to [SPECIFIC TARGET],"
"[SYSTEM MECHANISM] will respond within [TIME BOUND],"
"and [METRIC] will remain [COMPARATOR] [THRESHOLD]."

## Fault Condition examples:
"50% packet loss is injected on all egress from payment-service"
"the primary Postgres node is stopped"
"CPU is pinned at 90% on 2 of 5 API server instances"
"the Redis cache cluster is made unavailable for 2 minutes"

## System Mechanism examples:
"circuit breaker opens and serves cached responses"
"HPA scales from 3 to 6 replicas"
"read traffic fails over from replica to primary"
"retry logic with exponential backoff absorbs transient errors"

## Measurable Outcomes (tie to SLOs):
"error rate < 0.1%"
"p99 latency < 500ms"
"0 data loss events in transaction log"
"checkout completion rate > 98%"

Blast Radius Control

// THE MOST IMPORTANT CONCEPT IN CHAOS ENGINEERING

Blast radius is the scope of potential impact of a chaos experiment. Minimizing blast radius is not timidity — it's engineering discipline. You learn the most from the smallest experiment that reveals information about system behavior. Expand only after smaller experiments confirm safety.

Dev
SANDBOX

Staging
SAFE

Prod Canary
SCOPED

Prod Partial
CONTROLLED

Prod Full
HIGH CONFIDENCE ONLY

Blast Radius Limiting Techniques Enforce

Target selectors: scope to labels, namespaces, instance IDs — never "all"
Percentage-based: attack 1 of N, not all N simultaneously
Canary cohort: route a fraction of traffic to the experiment population
Time boxing: hard duration limits with automatic rollback at T+N
Feature flags: isolate experiment to users with a chaos-test flag enabled
Off-peak timing: run destructive experiments during low-traffic windows

Abort Criteria — Kill Switches Required

SLO breach: halt if error budget is being consumed at 2× normal rate
Alert fires: any P1/P2 page halts the experiment immediately
Manual abort: single command or button to stop all fault injection
Time limit: auto-halt if experiment exceeds N minutes regardless of state
Dependency health: halt if unrelated downstream services show degradation
Customer signal: support ticket spike or social media monitoring trigger

💥

The Three Unbreakable Rules: (1) Always have observability in place before the experiment starts — if you can't measure it, you can't control it. (2) Always have an abort button that anyone on the team can press in under 10 seconds. (3) Never run destructive experiments without an oncall engineer actively watching dashboards. Automation is not a substitute for human oversight during active fault injection.

Gremlin

// COMMERCIAL CHAOS PLATFORM — ENTERPRISE GRADE

Gremlin is the leading commercial chaos engineering platform. It provides a broad attack library (resource, network, state, host), target selectors, scheduled experiments, scenario composition, reliability score reporting, and an always-on kill switch. It is infrastructure-agnostic — bare metal, VMs, containers, Kubernetes, and cloud-managed services.

Gremlin Attack Categories

Resource: CPU, memory, disk fill, I/O — exhaust resources to expose capacity limits
Network: latency, packet loss, packet corruption, bandwidth throttle, DNS failure, blackhole
State: process killer, time travel, power outage simulation, container kill
Application: HTTP error injection, exception injection, custom scripts

Gremlin Scenarios

Scenarios chain multiple attacks into multi-step experiments — e.g., "CPU spike → latency → pod kill" — testing how the system responds to compound failures. Scenarios can be scheduled, run in parallel, and include dependency checks before execution.

Gremlin CLI — Common Operations

Bash — Gremlin CLI

# Install Gremlin agent (Ubuntu/Debian)
curl https://rpm.gremlin.com/gremlin.repo -o /etc/yum.repos.d/gremlin.repo
apt-get install -y gremlin gremlind

# Authenticate
gremlin init --team-id $GREMLIN_TEAM_ID --secret $GREMLIN_SECRET

# ── RESOURCE ATTACKS ──────────────────────────────────

# CPU attack — spike to 90% on 4 cores for 120s
gremlin attack-cpu \
  --length 120 \
  --cores 4 \
  --percent 90 \
  --target-labels "env=production,service=payment-api"

# Memory attack — consume 2GB for 60s
gremlin attack-memory \
  --length 60 \
  --mb 2048 \
  --target-count 1 \
  --target-labels "k8s.namespace=payments"

# Disk fill — fill /tmp to 90% capacity for 30s
gremlin attack-disk \
  --length 30 \
  --volume-path /tmp \
  --percent 90 \
  --block-size 4 \
  --target-count 1

# ── NETWORK ATTACKS ───────────────────────────────────

# Latency — add 200ms to egress traffic on port 5432 (PostgreSQL)
gremlin attack-latency \
  --length 120 \
  --delay 200 \
  --egress-ports 5432 \
  --target-labels "service=order-service"

# Packet loss — 30% loss affecting Redis traffic
gremlin attack-packet-loss \
  --length 60 \
  --percent 30 \
  --egress-ports 6379 \
  --target-count 1 \
  --target-labels "env=production"

# Blackhole — block ALL traffic to a specific IP (simulate AZ failure)
gremlin attack-blackhole \
  --length 120 \
  --ip-addresses 10.0.2.45 \
  --target-labels "tier=api"

# ── STATE ATTACKS ─────────────────────────────────────

# Kill a process by name
gremlin attack-process \
  --length 1 \
  --process nginx \
  --target-count 1 \
  --target-labels "service=api-gateway"

# Kill with immediate halt (sigkill, no graceful shutdown)
gremlin attack-shutdown \
  --delay 5 \
  --reboot true \
  --target-count 1 \
  --target-labels "service=api-gateway"

Gremlin Scenarios via API

JSON — Gremlin Scenario Definition

{
  "name": "Payment Service Dependency Failure",
  "description": "Simulate Redis cache unavailability and verify fallback path",
  "hypothesis": "Payment success rate stays above 99.5% with Redis unavailable",

  "steps": [
    {
      "type": "delay",
      "delay": 30,          // 30s observation baseline before fault
      "description": "Establish steady-state baseline"
    },
    {
      "type": "attack",
      "attack": {
        "type": "blackhole",
        "args": {
          "length": 120,
          "egress_ports": [6379],   // Redis port
          "protocol": "TCP"
        }
      },
      "target": {
        "labels": { "service": "payment-api", "env": "production" },
        "count": 1              // Only 1 pod, not all
      },
      "description": "Block Redis egress on 1 payment-api pod"
    },
    {
      "type": "delay",
      "delay": 120,
      "description": "Recovery observation period"
    }
  ],

  "hypothesis_checks": [
    {
      "metric": "payment.success_rate",
      "threshold": 0.995,
      "comparison": "gte",
      "source": "datadog"
    }
  ],

  "halt_conditions": [
    { "metric": "payment.success_rate", "threshold": 0.99, "comparison": "lt" },
    { "metric": "pagerduty.active_p1_count", "threshold": 0, "comparison": "gt" }
  ]
}

✓

Gremlin Reliability Score: Gremlin produces a Reliability Score across your infrastructure by analyzing experiment results and system configurations. It tracks which services have been tested, which vulnerabilities have been confirmed and fixed, and provides a coverage heatmap across your MITRE-aligned failure taxonomy. Use it to track program maturity over time.

LitmusChaos

// CNCF GRADUATED PROJECT — KUBERNETES-NATIVE

LitmusChaos is a CNCF graduated open-source chaos engineering framework for Kubernetes. It provides a rich library of ChaosExperiments (200+), a ChaosCenter UI for experiment management, workflow orchestration via Argo Workflows, and deep Kubernetes integration using CRDs. It is the default choice for cloud-native Kubernetes environments.

Architecture

ChaosEngine: CR linking a target app to an experiment
ChaosExperiment: CR defining the fault type and parameters
ChaosResult: CR storing experiment results and verdict
Chaos Runner: Job that orchestrates experiment execution
ChaosCenter: Web UI for workflows and reporting

Experiment Hub

Pod faults: pod-delete, pod-cpu-hog, pod-memory-hog, pod-io-stress, pod-network-latency, pod-network-loss, pod-network-corruption
Node faults: node-cpu-hog, node-memory-hog, node-drain, node-taint, node-restart
Cloud: ec2-stop, ebs-loss, rds-instance-stop, azure-instance-stop

Steady-State Probes

HTTP probe: validate endpoint returns expected status code
Command probe: run a shell command, check exit code / output
Prometheus probe: query PromQL expression, validate value range
Kubernetes probe: check pod/service/node status in the cluster

LitmusChaos — ChaosEngine Manifest

YAML — Pod Delete Experiment

# Install LitmusChaos
# kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.x.x.yaml

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-delete-chaos
  namespace: production
spec:
  # Target application
  appinfo:
    appns: production
    applabel: "app=payment-api"
    appkind: deployment

  # Abort if target goes unhealthy before experiment
  engineState: "active"
  annotationCheck: "true"   # Only target annotated pods (opt-in)

  # Steady-State hypothesis probes
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: "check-payment-api-availability"
            type: "httpProbe"
            httpProbe/inputs:
              url: "https://payment-api.internal/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            mode: "Continuous"         # probe runs throughout experiment
            runProperties:
              probeTimeout: 5
              interval: 2              # every 2s
              attempt: 3
              stopOnFailure: true   # ABORT if probe fails

          - name: "check-success-rate-prometheus"
            type: "promProbe"
            promProbe/inputs:
              endpoint: "http://prometheus:9090"
              query: "sum(rate(payment_requests_total{status='200'}[2m])) / sum(rate(payment_requests_total[2m]))"
              comparator:
                criteria: ">="
                value: "0.998"
            mode: "EOT"              # check at End Of Test
            runProperties:
              probeTimeout: 10
              interval: 5
              attempt: 3

        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"          # 120s experiment duration
            - name: CHAOS_INTERVAL
              value: "20"           # delete a pod every 20s
            - name: PODS_AFFECTED_PERC
              value: "33"           # affect ~1/3 of matching pods
            - name: FORCE
              value: "false"        # graceful termination (SIGTERM)
            - name: TARGET_PODS
              value: ""             # empty = random selection

LitmusChaos — Network Latency Experiment

YAML — Pod Network Latency

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-svc-latency-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: "app=order-service"
    appkind: deployment
  engineState: "active"

  experiments:
    - name: pod-network-latency
      spec:
        probe:
          - name: "checkout-latency-check"
            type: "promProbe"
            promProbe/inputs:
              endpoint: "http://prometheus:9090"
              # p99 checkout latency must remain under 800ms
              query: "histogram_quantile(0.99, sum(rate(checkout_duration_seconds_bucket[2m])) by (le))"
              comparator:
                criteria: "<="
                value: "0.8"
            mode: "Continuous"
            runProperties:
              probeTimeout: 10
              interval: 10
              attempt: 5
              stopOnFailure: true

        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: NETWORK_LATENCY
              value: "200"         # 200ms added latency (ms)
            - name: JITTER
              value: "50"          # ±50ms jitter
            - name: PODS_AFFECTED_PERC
              value: "50"          # 50% of matching pods
            - name: DESTINATION_IPS
              value: ""            # empty = all egress
            - name: DESTINATION_HOSTS
              value: "fraud-service.production.svc.cluster.local"
            - name: CONTAINER_RUNTIME
              value: "containerd"
            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"

Bash — Query ChaosResult after experiment

# Check experiment verdict
kubectl get chaosresult -n production

# Detailed result including probe status
kubectl describe chaosresult payment-pod-delete-chaos-pod-delete -n production

# Get verdict in script-friendly form
VERDICT=$(kubectl get chaosresult payment-pod-delete-chaos-pod-delete \
  -n production -o jsonpath='{.status.experimentStatus.verdict}')

if [ "$VERDICT" == "Pass" ]; then
  echo "✓ Experiment PASSED — system behaved as hypothesized"
elif [ "$VERDICT" == "Fail" ]; then
  echo "✗ Experiment FAILED — weakness detected, creating remediation ticket"
  # CI/CD integration: fail the pipeline
  exit 1
fi

Other Tools

// THE CHAOS ENGINEERING ECOSYSTEM

Tool	Type	Best For	Model
Chaos Monkey	Instance termination	AWS EC2 random termination — the original. Production-only. Tests auto-scaling and self-healing infrastructure.	OSS
Chaos Toolkit	Multi-target CLI	Open-source, extensible, driver-based. AWS, GCP, Azure, Kubernetes, Prometheus integrations. Best for scripted, composable experiments via JSON/YAML.	OSS
AWS Fault Injection Service (FIS)	AWS-native	Managed chaos for EC2, ECS, EKS, RDS, DynamoDB. Native AWS integration — no agent. IAM-controlled. Ideal for AWS-centric stacks.	MANAGED
Azure Chaos Studio	Azure-native	Azure PaaS/IaaS fault injection — VM, AKS, CosmosDB, App Service, network, disk. Integrated with Azure Monitor for abort conditions.	MANAGED
tc (Linux Traffic Control)	Network fault primitive	Kernel-level network emulation. The underlying mechanism most tools use. Useful for custom network experiments without additional tooling.	OSS
stress-ng	Resource stress	CPU, memory, I/O, and OS resource stressors. Foundational tool for manual resource exhaustion testing before adopting a full chaos platform.	OSS
ChaosBlade	Multi-layer	Host, container, Kubernetes, JVM (Java/Node) fault injection. Strong Java process-level chaos (method-level delays/exceptions). CNCF project.	OSS

AWS Fault Injection Service — Quick Example

JSON — AWS FIS Experiment Template

{
  "description": "Inject CPU stress on 1 production ECS task",
  "targets": {
    "PaymentECSTasks": {
      "resourceType": "aws:ecs:task",
      "resourceTags": { "service": "payment-api", "env": "production" },
      "selectionMode": "COUNT(1)"   // exactly 1 task
    }
  },
  "actions": {
    "InjectCPUStress": {
      "actionId": "aws:ecs:cpu-stress",
      "parameters": {
        "duration": "PT2M",        // ISO 8601: 2 minutes
        "percent": "80",          // 80% CPU
        "workers": "0"           // 0 = use all available cores
      },
      "targets": { "Tasks": "PaymentECSTasks" }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "payment-api-error-rate-high"  // halt if alarm fires
    }
  ],
  "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole",
  "tags": { "purpose": "chaos-experiment", "experiment-id": "CHE-0042" }
}

Resource Attacks

// CPU · MEMORY · DISK · I/O

Resource attacks simulate the conditions of runaway processes, memory leaks, disk-filling log floods, and I/O saturation. These are among the most common real-world failure modes — every system should be tested against them before they happen accidentally.

CPU Exhaustion Common

What it tests: CPU throttling behavior, request queue growth, timeout propagation, HPA scaling trigger time
What good looks like: autoscaler provisions new pods within SLO window; graceful degradation of low-priority work; no cascading failures to downstream services
Tool: stress-ng --cpu 4 --cpu-load 90 --timeout 120s
Gremlin: CPU attack on 1 pod at 90% for 120s, observe HPA events

Memory Exhaustion Important

What it tests: OOM killer behavior, container restart policy, memory limits vs requests, pod eviction priority
What good looks like: OOM kills the specific container; k8s restarts within 30s; service remains available via other replicas; alerting fires
Tool: stress-ng --vm 1 --vm-bytes 4G --timeout 60s
Tip: Test both the hit-limit-and-OOM path and the slow-leak path (gradual consumption over hours)

Disk Fill Critical

What it tests: log rotation effectiveness, ephemeral storage limits, disk-full error handling in databases
What good looks like: service degrades gracefully (not crashes); alerts fire before 100%; log rotation kicks in; recovery after space is freed
Tool: fallocate -l 10G /tmp/chaos-fill (cleanup: rm /tmp/chaos-fill)

I/O Saturation Often Missed

What it tests: disk I/O scheduling under load, read vs write contention, database buffer pool behavior under slow disk
What it reveals: applications that don't set I/O timeouts, synchronous I/O blocking async event loops, database lock contention amplification
Tool: stress-ng --io 4 --hdd 2 --timeout 120s

Network Chaos

// LATENCY · LOSS · PARTITION · CORRUPTION

Network chaos is the highest-value category of experiments for distributed systems. The network is the most common and most variable failure mode in cloud infrastructure. Microservice architectures have hundreds of network paths — every one of them needs circuit breakers, timeouts, and retries that actually work under adverse conditions.

Fault Type	Models	What It Validates	tc Command
Latency Injection	Fixed delay, jitter, distribution	Timeout configs are set; circuit breakers open at correct threshold; p99 stays within SLO	`tc qdisc add dev eth0 root netem delay 200ms 50ms`
Packet Loss	% random loss, burst loss (Gilber model)	TCP retransmit behavior; application-level retry logic; error rate under partial loss	`tc qdisc add dev eth0 root netem loss 10%`
Packet Corruption	Random bit errors	Protocol-level error detection; TLS integrity; application checksum validation	`tc qdisc add dev eth0 root netem corrupt 5%`
Packet Duplication	% duplicate packets	Idempotency of request handlers; deduplication logic; double-charge prevention	`tc qdisc add dev eth0 root netem duplicate 5%`
Bandwidth Throttle	Rate limiting to N Mbit/s	Large payload transfers (uploads, exports); streaming behavior under constrained bandwidth	`tc qdisc add dev eth0 root tbf rate 1mbit burst 32kbit latency 400ms`
Network Partition	Complete blackhole to specific host/port	Circuit breaker opens; fallback activates; database failover works; split-brain handling	`iptables -A OUTPUT -d 10.0.1.5 -j DROP`
DNS Failure	NXDOMAIN, SERVFAIL, timeout	DNS caching behavior; connection pooling reuse; service discovery failure path	Block UDP port 53 via iptables or corrupt /etc/resolv.conf

Validating Circuit Breakers

Experiment — Circuit Breaker Validation

Experiment: CHE-0021
Hypothesis: "When fraud-service returns 100% HTTP 503 for 30s,
              the circuit breaker in payment-api opens within 10s,
              and payments fall back to allow-by-default, maintaining
              payment success rate > 99.5%."

SETUP: Inject HTTP 503 responses using a Toxiproxy fault proxy:

# Deploy Toxiproxy as a sidecar or standalone fault proxy
docker run -d --name toxiproxy \
  -p 8474:8474 \   # API port
  -p 9000:9000 \   # proxied port
  ghcr.io/shopify/toxiproxy

# Create proxy: payment-api → fraud-service via toxiproxy
curl -X POST http://localhost:8474/proxies \
  -H "Content-Type: application/json" \
  -d '{"name": "fraud-svc", "listen": "0.0.0.0:9000", "upstream": "fraud-service:8080"}'

# INJECT: Return HTTP 503 on all requests (simulates service crash)
curl -X POST http://localhost:8474/proxies/fraud-svc/toxics \
  -H "Content-Type: application/json" \
  -d '{
    "name": "http-503-fault",
    "type": "limit_data",
    "attributes": {"bytes": 0}
  }'

# OBSERVE: Watch circuit breaker state in payment-api metrics
# Look for: resilience4j_circuitbreaker_state{name="fraud-service"} == 2 (OPEN)
watch -n 2 curl -s http://payment-api:8080/actuator/health | jq .

# CLEANUP: Remove toxic (restore normal behavior)
curl -X DELETE http://localhost:8474/proxies/fraud-svc/toxics/http-503-fault

State & Process Chaos

// POD KILLS · PROCESS CRASHES · CLOCK SKEW · ZONE FAILURE

Pod / Process Termination High Value

The most basic chaos experiment. Kill one or more pods and verify the service remains available. Tests Kubernetes scheduling, readiness probes, pod disruption budgets, and graceful shutdown behavior (SIGTERM → drain → SIGKILL).

kubectl delete pod payment-api-7d9f4b-xk2zr -n production

Key checks: downtime during pod death, replacement scheduling time, connection draining (no in-flight 500s during graceful shutdown), PDB enforcement prevents all-pods deletion.

Time Travel / Clock Skew Subtle

Shift the system clock forward or backward on target nodes. Tests: JWT expiry validation (tokens suddenly "expired"), certificate validity window checks, distributed lock TTLs, time-series data ordering, cron job timing. Often reveals hidden assumptions about clock synchronization.

date -s "+2 hours" — Gremlin time-travel attack: ±1h to ±30d

Availability Zone Failure Critical

Simulate the loss of an entire AZ by blackholing all traffic to that zone's CIDRs, stopping all instances in that AZ, or removing them from load balancer target groups. This is the experiment that reveals whether your "multi-AZ deployment" is actually active-active or just active-passive with undiscovered split-brain conditions.

Dependency Timeouts Systemic

Inject high latency (10s, 30s, 60s) on calls to external dependencies — not to cause failures, but to verify that every service has a configured timeout and that timeout expiry is handled gracefully. The most common real-world failure: a dependency hangs indefinitely, exhausting the caller's connection pool.

Bash — Graceful Shutdown Verification

# Test that in-flight requests complete before pod terminates
# Run load against the service while deleting the pod

# Terminal 1: generate continuous load
hey -z 120s -c 10 -q 50 https://payment-api.internal/v1/checkout

# Terminal 2: delete a specific pod while load is running
POD=$(kubectl get pods -n production -l app=payment-api -o name | head -1)
kubectl delete $POD -n production

# EXPECTED RESULT:
# - Zero HTTP 5xx errors during pod deletion
# - pod terminates after preStop hook drains connections
# - hey summary shows 0% error rate

# Verify terminationGracePeriodSeconds is sufficient
kubectl get pod $POD -n production -o jsonpath='{.spec.terminationGracePeriodSeconds}'

# Verify preStop hook is configured
kubectl get deployment payment-api -n production \
  -o jsonpath='{.spec.template.spec.containers[0].lifecycle}' | jq .

GameDay Runbook

// RUNNING LIVE CHAOS EVENTS

A GameDay is a structured event where a team runs chaos experiments together, testing both technical resilience and the human/process response — alert quality, runbook accuracy, oncall ergonomics, and communication flows. GameDays are the chaos engineering equivalent of a fire drill.

GameDay Runbook — Full Template

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GAMEDAY: Payment Service Resilience   Date: 2024-Q3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Facilitator:     @sre-lead
Observers:       @payment-team, @platform-team
War Room:        #gameday-2024-q3 (Slack)
Dashboard:       https://datadog.company.com/dashboard/gameday-q3
Rollback Owner:  @oncall-engineer (has kill-switch access)
Duration:        3 hours (14:00 - 17:00 UTC)

PRE-GAME CHECKLIST (30 minutes before)
  ☐ All experiment scripts prepared and tested in staging
  ☐ Dashboards verified streaming — no data gaps
  ☐ PagerDuty configured: test alerts suppressed, real alerts active
  ☐ Rollback procedures documented and reviewed with team
  ☐ Stakeholder notification sent (support, leadership aware)
  ☐ Baseline metrics captured (last 1h avg for all key metrics)
  ☐ Kill-switch access confirmed: @oncall has gremlin halt + kubectl access
  ☐ External dependency owners on standby (DB team, infra team)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SCENARIO 1 — 14:00 UTC: Redis Cache Unavailability (20 min)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Hypothesis: "Cache miss fallback serves all requests from DB.
               Success rate > 99.5%. Latency p99 < 500ms."
  Fault:      Gremlin blackhole on port 6379 — 1 of 3 payment-api pods
  Duration:   10 min fault + 10 min recovery observation
  Abort:      success_rate < 99.0% for 30s OR any P1 fires
  Observer:   Watch payment.success_rate, payment.latency.p99

SCENARIO 2 — 14:30 UTC: Downstream Timeout Injection (20 min)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Hypothesis: "5s timeout on fraud-service calls triggers circuit
               breaker. Payments default to ALLOW. Zero failures."
  Fault:      Toxiproxy latency 6000ms on fraud-service egress
  Duration:   8 min fault + 10 min CB state recovery observation
  Abort:      Any payment transaction errors > 0.5% for 60s

SCENARIO 3 — 15:00 UTC: Pod Kill Cascade (30 min)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Hypothesis: "Sequential termination of 2 of 3 payment-api pods
               maintains service availability. k8s schedules
               replacements within 60s. p99 < 400ms throughout."
  Fault:      kubectl delete 1 pod, wait 45s, delete 1 more
  Duration:   20 min (including pod scheduling observation)
  Abort:      PDB violation (k8s should prevent 3rd pod deletion)

POST-GAME PROCESS (17:00 UTC)
  1. IMMEDIATE (same day):
     ☐ Document all findings in #gameday-findings
     ☐ Confirm all faults removed (no lingering tc rules, toxics, etc.)
     ☐ Verify all metrics returned to pre-game baseline
     ☐ Create tickets for all identified weaknesses (P1 within 24h)

  2. WITHIN 1 WEEK:
     ☐ Post-mortem document published
     ☐ Remediation tickets assigned with owners
     ☐ New chaos experiments created from discovered weaknesses
     ☐ Next GameDay date set (target: quarterly cadence)

Observability for Chaos

// METRICS · LOGS · TRACES · EVENTS

Chaos Engineering without observability is just breaking things. You need sufficient signal before you start — not just to detect problems, but to validate the hypothesis. The experiment is only as good as your ability to measure its impact.

Signal Type	What to Capture	Tooling	Purpose
Steady-State Metrics	Request rate, error rate, latency (p50/p95/p99), saturation (CPU/mem/connections), SLO burn rate	Prometheus, Datadog, CloudWatch	Hypothesis validation — did the metric cross the threshold?
Experiment Events	Fault start/stop timestamps, target identifiers, experiment ID — injected as annotations	Datadog Events API, Grafana Annotations	Correlate metric changes to exact fault injection moment
Distributed Traces	Request traces from client through all services — show where latency or errors originate	Jaeger, Zipkin, Datadog APM, OTEL	Identify which hop in the call chain is affected by the fault
Kubernetes Events	Pod scheduling, OOM kills, evictions, autoscaler decisions, PDB violations	`kubectl get events -n production --sort-by='.lastTimestamp'`	Verify cluster response to pod failures — scheduling time, restart counts
Structured Logs	Circuit breaker state changes, fallback activations, retry counts, timeout events	ELK, Datadog Log Analytics, CloudWatch Insights	Confirm the intended failure mode triggered the right fallback path

Annotating Experiments in Dashboards

Python — Auto-annotate Datadog during experiment

import time
from datadog import initialize, api

initialize(api_key="$DD_API_KEY", app_key="$DD_APP_KEY")

EXPERIMENT_ID = "CHE-0042"
START_TIME = int(time.time())

# Mark experiment START on all dashboards
api.Event.create(
    title=f"[CHAOS] {EXPERIMENT_ID} — START",
    text="Injecting: Redis blackhole on payment-api (1 of 3 pods). Hypothesis: success_rate > 99.8%",
    tags=["chaos", "env:production", "service:payment-api", f"experiment:{EXPERIMENT_ID}"],
    alert_type="warning",
    source_type_name="chaos-engineering"
)

# — RUN EXPERIMENT —
time.sleep(120)  # experiment duration

# Mark experiment END
api.Event.create(
    title=f"[CHAOS] {EXPERIMENT_ID} — COMPLETE",
    text="Fault injection complete. Entering recovery observation.",
    tags=["chaos", "env:production", f"experiment:{EXPERIMENT_ID}"],
    alert_type="success"
)

# Query metrics during experiment window
results = api.Metric.query(
    start=START_TIME,
    end=int(time.time()),
    query=f"avg:payment.success_rate{{env:production}}.rollup(min, 10)"
)

min_success_rate = min(point[1] for point in results['series'][0]['pointlist'])
print(f"Min success rate during experiment: {min_success_rate:.3%}")
print(f"Hypothesis: {'PASS ✓' if min_success_rate >= 0.998 else 'FAIL ✗'}")

Chaos Maturity Model

// WHERE ARE YOU TODAY

The Chaos Engineering Maturity Model measures how systematically an organization applies chaos engineering — from ad-hoc manual experiments to continuous automated resilience verification across the full production stack.

Stage 01

Ad-hoc

Manual, one-off experiments
No formal hypothesis
Dev/staging only
No tracking or reporting
Hero-driven (one SRE does it)
No runbooks or rollback

Stage 02

Defined

Hypothesis-driven experiments
Documented templates exist
Quarterly GameDays
Blast radius documented
Findings tracked as tickets
Basic observability in place

Stage 03

Managed

Platform in place (Gremlin/Litmus)
Regular prod experiments
Reliability score tracked
CI/CD integration (some tests)
Xfn ownership across teams
Post-mortems include chaos tasks

Stage 04

Optimized

Continuous automated experiments
Every incident → new experiment
Full MITRE-style coverage map
Dynamic, risk-based targeting
Chaos as a release gate
Org-wide resilience SLOs

✓

Start with Stage 2, not Stage 4: Most teams over-invest in tooling before building the cultural practices — hypothesis documentation, gameday cadence, findings tracking. A quarterly GameDay with a spreadsheet and kubectl delete pod delivers more value than a fully instrumented platform that nobody runs experiments on. Nail the scientific discipline first, then automate it.

Implementation Roadmap

// GETTING FROM ZERO TO CONTINUOUS CHAOS

12-Month Chaos Engineering Roadmap

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 0 — FOUNDATIONS (Month 1-2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✦ Observability audit — do you have the signals to run experiments?
    (Prometheus metrics, distributed tracing, structured logs, dashboards)
  ✦ Incident post-mortem review — identify top 5 failure modes from history
  ✦ Select 1 service as pilot — ideally one with good coverage and a team
    willing to own the experiment program
  ✦ Run first experiment in staging: pod-delete on the pilot service
  ✦ Establish hypothesis template and experiment log (even a spreadsheet)
  ✦ Identify GameDay facilitator and oncall rotation integration

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 1 — CONTROLLED PRODUCTION (Month 3-5)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✦ Install chaos tool: LitmusChaos (Kubernetes) or Gremlin (multi-target)
  ✦ First production experiment: pod-delete on 1 pod of pilot service
  ✦ Run monthly GameDays with the pilot team — 2-3 scenarios each
  ✦ Document all weaknesses found; create remediation tickets
  ✦ Expand experiment catalog: network latency, dependency blackhole
  ✦ Build abort-condition automation: Prometheus alert → halt webhook
  ✦ Share findings with broader engineering org (build momentum)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 2 — SCALE ACROSS SERVICES (Month 6-9)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✦ Expand to 5+ services — each team owns their experiment backlog
  ✦ Chaos experiment template in sprint planning: each quarter, 2 new exps
  ✦ Post-mortem → experiment policy: every P1/P2 incident gets a follow-up
    chaos experiment that reproduces the failure mode
  ✦ AZ failure simulation: test active-active multi-AZ claim for top services
  ✦ Database failover: test RDS/CloudSQL promotion for all stateful services
  ✦ Resource attack catalog: CPU/memory/disk on all services at least once
  ✦ Quarterly cross-service GameDay: invite leadership to observe

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 3 — CONTINUOUS AUTOMATION (Month 9-12)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✦ Integrate lightweight chaos experiments into CI/CD pipelines
    (pod-delete, latency injection run automatically on staging deploys)
  ✦ Schedule continuous low-blast-radius experiments in production
    (e.g., random pod-delete during business hours, every weekday)
  ✦ Reliability score / coverage heatmap — track coverage vs MITRE faults
  ✦ Resilience SLO: "All Tier-1 services must pass pod-delete experiment"
  ✦ Self-service chaos: teams run their own experiments via portal/CLI
  ✦ Chaos-as-code: experiments in Git, reviewed in PRs, run in pipelines

Common Failure Modes Avoid

Running experiments without observability in place first
Starting in production before staging validation
No abort criteria — experiment runs until it causes an incident
Fixing the weakness and not re-running the experiment to verify
Siloed — one SRE runs chaos, teams don't own experiments
Focusing on tooling before hypothesis discipline
Running chaos during major product launches or freeze windows

Success Factors Must-Have

Observability precedes every experiment — measure first
Stage experiments: sandbox → staging → canary → production
Always have a human kill-switch and abort criteria defined
Every weakness found is remediated and then re-tested
Teams own their chaos backlogs — chaos is a team responsibility
Post-mortems generate experiments — close the feedback loop
Executive visibility through reliability score dashboards

ℹ

Reference Material: Principles of Chaos Engineering (principlesofchaos.org) · Netflix Tech Blog: Chaos Engineering (the original) · Google SRE Book: Chapter 17 (Testing for Reliability) · AWS Well-Architected: Reliability Pillar · LitmusChaos Docs (litmuschaos.io) · Gremlin Chaos Engineering Handbook (gremlin.com/community) · CNCF Chaos Engineering TAG Working Group