V1
Back to handbooks index
Chaos Eng
Field Handbook
DOC-CHE-2024
ACTIVE
Resilience Engineering · Field Handbook

Chaos
Engineering

// "If it hurts, do it more often — in a controlled way, before your users do it for you."

The practitioner's guide to intentional failure injection. Covers Gremlin, LitmusChaos, network fault experiments, resource exhaustion, gameday planning, and a maturity model for building provably resilient systems.

NIST SP 800-160
Gremlin · LitmusChaos
SRE Aligned
CNCF Landscape
01

Overview & Philosophy

// WHAT IS CHAOS ENGINEERING

Chaos Engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Popularized by Netflix's Simian Army, formalized in the 2016 "Principles of Chaos Engineering" manifesto, and now a CNCF project ecosystem, it is the engineering practice of proactively injecting controlled failure to discover weaknesses before they manifest as incidents.

The key word is experiment. Chaos Engineering is not random destruction. It is a scientific process: define steady state, propose a hypothesis, inject failure, observe, and learn. Every experiment should answer a specific question about system behavior under adverse conditions.

Break It First

Proactively cause failures in a controlled environment before your users cause them accidentally — at 3 AM, during peak traffic, with no runbook.

Learn From It

Every experiment produces knowledge. Either your hypothesis is confirmed (system is resilient), or you've discovered a real weakness — before it costs you uptime.

Harden It

Weaknesses become hardening tasks. Fix the failure mode. Re-run the experiment. Repeat until the system is provably resilient — not just presumably resilient.

🔥
Chaos ≠ Random: Breaking things randomly is sabotage. Chaos Engineering is a structured scientific method with defined scope, observability pre-requisites, rollback procedures, and explicit hypotheses. If you can't measure the impact, you're not doing chaos engineering — you're just breaking things.
02

Core Principles

// THE FIVE PRINCIPLES OF CHAOS ENGINEERING

The Principles of Chaos Engineering define five foundational rules, evolved from Netflix's decade of production experimentation.

PRINCIPLE-01
Build a Hypothesis Around Steady State

Focus on the measurable output of the system — not internal attributes. Steady state is any metric you can observe that represents normal operation: request throughput, error rate, p99 latency, orders per minute. Your hypothesis must be quantitative: "Steady state is maintained when error rate remains <0.1% and p99 latency remains <250ms."

PRINCIPLE-02
Vary Real-World Events

Inject events that reflect real failure modes: hardware failures, network partitions, dependency timeouts, traffic surges, malicious inputs. Don't invent fictional failure modes — model experiments on post-mortems and known failure patterns from your incident history and from industry reports (AWS us-east-1 outages, BGP leaks, memory leak patterns).

PRINCIPLE-03
Run Experiments in Production

Staging environments don't reflect production traffic patterns, data shapes, or dependency graphs. Experiments in staging teach you about staging. The goal is production confidence — which requires production experiments, with appropriate blast radius controls. Start with canary cohorts (e.g., 1% of hosts) and expand only after validation.

PRINCIPLE-04
Automate Experiments to Run Continuously

Systems change constantly — new deployments, dependency version bumps, config changes. A resilience property confirmed last month may not hold today. Integrate experiments into your CI/CD pipeline and run them on a schedule. A passing chaos experiment is a green build signal, same as a passing unit test.

PRINCIPLE-05
Minimize Blast Radius

The cost of discovery must always be less than the cost of a production incident. Scope every experiment precisely — limit target population (1 pod, 1 region, 1% of users), set a maximum duration, pre-configure automatic rollback, and have a human kill-switch always accessible. Expand scope only when smaller tests confirm safety.

03

Why Chaos Engineering

// THE CASE FOR INTENTIONAL FAILURE

Complex distributed systems have failure modes that cannot be reasoned about from first principles or detected by unit tests. The number of possible state combinations in a microservices architecture with 50+ services exceeds what any human or test suite can enumerate. The only way to know how the system behaves is to observe it under stress.

✕ Without Chaos Engineering
  • Unknown failure modes discovered during incidents
  • Resilience assumptions based on hope, not evidence
  • Dependency failures cause cascading outages
  • Runbooks untested — fail during the actual incident
  • Engineers fear production; deploys are risky rituals
  • MTTR measured in hours, not minutes
  • "We thought we had circuit breakers, but..."
✓ With Chaos Engineering
  • Failure modes discovered and remediated proactively
  • Resilience properties are measured, not assumed
  • Dependency failures are isolated behind tested circuit breakers
  • Runbooks executed under experiment conditions — battle-tested
  • Engineers build confidence through regular controlled failures
  • MTTR driven down through practiced recovery
  • Every incident produces a new chaos experiment
Netflix origin: Netflix's Chaos Monkey (2010) was created specifically because their AWS migration eliminated physical hardware failures — which had previously forced resilience. They needed to simulate the failure modes they could no longer rely on infrastructure to provide. The Simian Army grew from there. Today, Netflix runs thousands of chaos experiments per year automatically.
04

The Scientific Method

// EXPERIMENT LIFECYCLE

Every chaos experiment follows a structured scientific method. Deviating from this — especially skipping observability setup or rollback planning — converts a controlled experiment into a production incident.

Step 01
Define Steady State
Baseline metrics, SLO thresholds
Step 02
Hypothesize
Predict system behavior under fault
Step 03
Design
Scope, targets, duration, abort criteria
Step 04
Execute
Inject fault, observe, monitor abort signals
Step 05
Analyze
Compare actual vs expected, document findings
Experiment Template — Structured Definition
Experiment: CHE-0042 Title: "Payment service degrades gracefully when database replica fails" Author: platform-reliability@company.com Date: 2024-08-15 Status: APPROVED STEADY STATE DEFINITION Metric: payments.success_rate Baseline: 99.95% over 5 min rolling window (measured 1h pre-experiment) Threshold: Remains ≥ 99.80% throughout experiment Source: Datadog dashboard: pay-svc-resilience HYPOTHESIS "When the payments-db replica is terminated, the service automatically fails over to the primary within 30s, and success rate remains above the SLO threshold of 99.80% with no customer-visible errors beyond the 30s failover window." SCOPE & TARGETING Environment: Production (us-east-1 only) Target: 1 of 3 RDS read replicas (payments-db-replica-2) User Impact: ~0% — replica handles read-only analytics queries only Duration: 10 minutes active fault + 10 minutes recovery observation ABORT CRITERIA (kill switch triggers) - payments.success_rate drops below 99.0% for more than 60s - Any P1 page fires during the experiment - Primary DB connections exceed 80% of max_connections ROLLBACK PROCEDURE 1. RDS auto-failover promotes replica (automatic, ~30s) 2. Manual: aws rds reboot-db-instance --db-instance-id payments-db-replica-2 3. Notify: #platform-oncall in Slack immediately on abort OBSERVABILITY CHECKLIST (must be green before starting) ☑ Datadog dashboard open and streaming ☑ PagerDuty muted for: payments-db-replica-health ☑ Oncall engineer watching: @sre-oncall ☑ Rollback command ready: terminal open, command staged
05

Hypothesis Design

// WRITING TESTABLE FAILURE HYPOTHESES

A well-formed chaos hypothesis has three components: a fault condition (what you inject), a predicted behavior (what the system should do), and a measurable outcome (how you verify the prediction was correct). Vague hypotheses produce uninterpretable results.

QualityExample HypothesisVerdict
Weak "The system will handle a database failure." REJECT — no measurable outcome, no specific fault
Weak "If we kill a pod, the service stays up." WEAK — binary outcome, no quantitative threshold
Good "When 1 of 3 api-gateway pods is terminated, Kubernetes schedules a replacement within 45s, and error rate remains <0.5% throughout." ACCEPT — specific fault, quantitative threshold, time-bound
Good "When 100ms of latency is added to all calls from payment-service to fraud-service, the circuit breaker opens within 10s and payments fall back to the allow-by-default path with 0 payment failures." ACCEPT — fault, expected mechanism, measurable outcome
Excellent "Under 200% of peak CPU load on 50% of cart-service instances, autoscaling provisions additional capacity within 90s, p99 checkout latency remains <800ms, and 0 shopping carts are lost." EXCELLENT — scoped, multiple metrics, mechanism + outcome

The Hypothesis Formula

Hypothesis Structure
## Template "When [FAULT CONDITION] is introduced to [SPECIFIC TARGET]," "[SYSTEM MECHANISM] will respond within [TIME BOUND]," "and [METRIC] will remain [COMPARATOR] [THRESHOLD]." ## Fault Condition examples: "50% packet loss is injected on all egress from payment-service" "the primary Postgres node is stopped" "CPU is pinned at 90% on 2 of 5 API server instances" "the Redis cache cluster is made unavailable for 2 minutes" ## System Mechanism examples: "circuit breaker opens and serves cached responses" "HPA scales from 3 to 6 replicas" "read traffic fails over from replica to primary" "retry logic with exponential backoff absorbs transient errors" ## Measurable Outcomes (tie to SLOs): "error rate < 0.1%" "p99 latency < 500ms" "0 data loss events in transaction log" "checkout completion rate > 98%"
06

Blast Radius Control

// THE MOST IMPORTANT CONCEPT IN CHAOS ENGINEERING

Blast radius is the scope of potential impact of a chaos experiment. Minimizing blast radius is not timidity — it's engineering discipline. You learn the most from the smallest experiment that reveals information about system behavior. Expand only after smaller experiments confirm safety.

Dev
SANDBOX
Staging
SAFE
Prod Canary
SCOPED
Prod Partial
CONTROLLED
Prod Full
HIGH CONFIDENCE ONLY
Blast Radius Limiting Techniques Enforce
  • Target selectors: scope to labels, namespaces, instance IDs — never "all"
  • Percentage-based: attack 1 of N, not all N simultaneously
  • Canary cohort: route a fraction of traffic to the experiment population
  • Time boxing: hard duration limits with automatic rollback at T+N
  • Feature flags: isolate experiment to users with a chaos-test flag enabled
  • Off-peak timing: run destructive experiments during low-traffic windows
Abort Criteria — Kill Switches Required
  • SLO breach: halt if error budget is being consumed at 2× normal rate
  • Alert fires: any P1/P2 page halts the experiment immediately
  • Manual abort: single command or button to stop all fault injection
  • Time limit: auto-halt if experiment exceeds N minutes regardless of state
  • Dependency health: halt if unrelated downstream services show degradation
  • Customer signal: support ticket spike or social media monitoring trigger
💥
The Three Unbreakable Rules: (1) Always have observability in place before the experiment starts — if you can't measure it, you can't control it. (2) Always have an abort button that anyone on the team can press in under 10 seconds. (3) Never run destructive experiments without an oncall engineer actively watching dashboards. Automation is not a substitute for human oversight during active fault injection.
07

Gremlin

// COMMERCIAL CHAOS PLATFORM — ENTERPRISE GRADE

Gremlin is the leading commercial chaos engineering platform. It provides a broad attack library (resource, network, state, host), target selectors, scheduled experiments, scenario composition, reliability score reporting, and an always-on kill switch. It is infrastructure-agnostic — bare metal, VMs, containers, Kubernetes, and cloud-managed services.

Gremlin Attack Categories
  • Resource: CPU, memory, disk fill, I/O — exhaust resources to expose capacity limits
  • Network: latency, packet loss, packet corruption, bandwidth throttle, DNS failure, blackhole
  • State: process killer, time travel, power outage simulation, container kill
  • Application: HTTP error injection, exception injection, custom scripts
Gremlin Scenarios

Scenarios chain multiple attacks into multi-step experiments — e.g., "CPU spike → latency → pod kill" — testing how the system responds to compound failures. Scenarios can be scheduled, run in parallel, and include dependency checks before execution.

Gremlin CLI — Common Operations

Bash — Gremlin CLI
# Install Gremlin agent (Ubuntu/Debian) curl https://rpm.gremlin.com/gremlin.repo -o /etc/yum.repos.d/gremlin.repo apt-get install -y gremlin gremlind # Authenticate gremlin init --team-id $GREMLIN_TEAM_ID --secret $GREMLIN_SECRET # ── RESOURCE ATTACKS ────────────────────────────────── # CPU attack — spike to 90% on 4 cores for 120s gremlin attack-cpu \ --length 120 \ --cores 4 \ --percent 90 \ --target-labels "env=production,service=payment-api" # Memory attack — consume 2GB for 60s gremlin attack-memory \ --length 60 \ --mb 2048 \ --target-count 1 \ --target-labels "k8s.namespace=payments" # Disk fill — fill /tmp to 90% capacity for 30s gremlin attack-disk \ --length 30 \ --volume-path /tmp \ --percent 90 \ --block-size 4 \ --target-count 1 # ── NETWORK ATTACKS ─────────────────────────────────── # Latency — add 200ms to egress traffic on port 5432 (PostgreSQL) gremlin attack-latency \ --length 120 \ --delay 200 \ --egress-ports 5432 \ --target-labels "service=order-service" # Packet loss — 30% loss affecting Redis traffic gremlin attack-packet-loss \ --length 60 \ --percent 30 \ --egress-ports 6379 \ --target-count 1 \ --target-labels "env=production" # Blackhole — block ALL traffic to a specific IP (simulate AZ failure) gremlin attack-blackhole \ --length 120 \ --ip-addresses 10.0.2.45 \ --target-labels "tier=api" # ── STATE ATTACKS ───────────────────────────────────── # Kill a process by name gremlin attack-process \ --length 1 \ --process nginx \ --target-count 1 \ --target-labels "service=api-gateway" # Kill with immediate halt (sigkill, no graceful shutdown) gremlin attack-shutdown \ --delay 5 \ --reboot true \ --target-count 1 \ --target-labels "service=api-gateway"

Gremlin Scenarios via API

JSON — Gremlin Scenario Definition
{ "name": "Payment Service Dependency Failure", "description": "Simulate Redis cache unavailability and verify fallback path", "hypothesis": "Payment success rate stays above 99.5% with Redis unavailable", "steps": [ { "type": "delay", "delay": 30, // 30s observation baseline before fault "description": "Establish steady-state baseline" }, { "type": "attack", "attack": { "type": "blackhole", "args": { "length": 120, "egress_ports": [6379], // Redis port "protocol": "TCP" } }, "target": { "labels": { "service": "payment-api", "env": "production" }, "count": 1 // Only 1 pod, not all }, "description": "Block Redis egress on 1 payment-api pod" }, { "type": "delay", "delay": 120, "description": "Recovery observation period" } ], "hypothesis_checks": [ { "metric": "payment.success_rate", "threshold": 0.995, "comparison": "gte", "source": "datadog" } ], "halt_conditions": [ { "metric": "payment.success_rate", "threshold": 0.99, "comparison": "lt" }, { "metric": "pagerduty.active_p1_count", "threshold": 0, "comparison": "gt" } ] }
Gremlin Reliability Score: Gremlin produces a Reliability Score across your infrastructure by analyzing experiment results and system configurations. It tracks which services have been tested, which vulnerabilities have been confirmed and fixed, and provides a coverage heatmap across your MITRE-aligned failure taxonomy. Use it to track program maturity over time.
08

LitmusChaos

// CNCF GRADUATED PROJECT — KUBERNETES-NATIVE

LitmusChaos is a CNCF graduated open-source chaos engineering framework for Kubernetes. It provides a rich library of ChaosExperiments (200+), a ChaosCenter UI for experiment management, workflow orchestration via Argo Workflows, and deep Kubernetes integration using CRDs. It is the default choice for cloud-native Kubernetes environments.

Architecture
  • ChaosEngine: CR linking a target app to an experiment
  • ChaosExperiment: CR defining the fault type and parameters
  • ChaosResult: CR storing experiment results and verdict
  • Chaos Runner: Job that orchestrates experiment execution
  • ChaosCenter: Web UI for workflows and reporting
Experiment Hub
  • Pod faults: pod-delete, pod-cpu-hog, pod-memory-hog, pod-io-stress, pod-network-latency, pod-network-loss, pod-network-corruption
  • Node faults: node-cpu-hog, node-memory-hog, node-drain, node-taint, node-restart
  • Cloud: ec2-stop, ebs-loss, rds-instance-stop, azure-instance-stop
Steady-State Probes
  • HTTP probe: validate endpoint returns expected status code
  • Command probe: run a shell command, check exit code / output
  • Prometheus probe: query PromQL expression, validate value range
  • Kubernetes probe: check pod/service/node status in the cluster

LitmusChaos — ChaosEngine Manifest

YAML — Pod Delete Experiment
# Install LitmusChaos # kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.x.x.yaml apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: payment-pod-delete-chaos namespace: production spec: # Target application appinfo: appns: production applabel: "app=payment-api" appkind: deployment # Abort if target goes unhealthy before experiment engineState: "active" annotationCheck: "true" # Only target annotated pods (opt-in) # Steady-State hypothesis probes experiments: - name: pod-delete spec: probe: - name: "check-payment-api-availability" type: "httpProbe" httpProbe/inputs: url: "https://payment-api.internal/health" insecureSkipVerify: false method: get: criteria: "==" responseCode: "200" mode: "Continuous" # probe runs throughout experiment runProperties: probeTimeout: 5 interval: 2 # every 2s attempt: 3 stopOnFailure: true # ABORT if probe fails - name: "check-success-rate-prometheus" type: "promProbe" promProbe/inputs: endpoint: "http://prometheus:9090" query: "sum(rate(payment_requests_total{status='200'}[2m])) / sum(rate(payment_requests_total[2m]))" comparator: criteria: ">=" value: "0.998" mode: "EOT" # check at End Of Test runProperties: probeTimeout: 10 interval: 5 attempt: 3 components: env: - name: TOTAL_CHAOS_DURATION value: "120" # 120s experiment duration - name: CHAOS_INTERVAL value: "20" # delete a pod every 20s - name: PODS_AFFECTED_PERC value: "33" # affect ~1/3 of matching pods - name: FORCE value: "false" # graceful termination (SIGTERM) - name: TARGET_PODS value: "" # empty = random selection

LitmusChaos — Network Latency Experiment

YAML — Pod Network Latency
apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: order-svc-latency-chaos namespace: production spec: appinfo: appns: production applabel: "app=order-service" appkind: deployment engineState: "active" experiments: - name: pod-network-latency spec: probe: - name: "checkout-latency-check" type: "promProbe" promProbe/inputs: endpoint: "http://prometheus:9090" # p99 checkout latency must remain under 800ms query: "histogram_quantile(0.99, sum(rate(checkout_duration_seconds_bucket[2m])) by (le))" comparator: criteria: "<=" value: "0.8" mode: "Continuous" runProperties: probeTimeout: 10 interval: 10 attempt: 5 stopOnFailure: true components: env: - name: TOTAL_CHAOS_DURATION value: "120" - name: NETWORK_LATENCY value: "200" # 200ms added latency (ms) - name: JITTER value: "50" # ±50ms jitter - name: PODS_AFFECTED_PERC value: "50" # 50% of matching pods - name: DESTINATION_IPS value: "" # empty = all egress - name: DESTINATION_HOSTS value: "fraud-service.production.svc.cluster.local" - name: CONTAINER_RUNTIME value: "containerd" - name: SOCKET_PATH value: "/run/containerd/containerd.sock"
Bash — Query ChaosResult after experiment
# Check experiment verdict kubectl get chaosresult -n production # Detailed result including probe status kubectl describe chaosresult payment-pod-delete-chaos-pod-delete -n production # Get verdict in script-friendly form VERDICT=$(kubectl get chaosresult payment-pod-delete-chaos-pod-delete \ -n production -o jsonpath='{.status.experimentStatus.verdict}') if [ "$VERDICT" == "Pass" ]; then echo "✓ Experiment PASSED — system behaved as hypothesized" elif [ "$VERDICT" == "Fail" ]; then echo "✗ Experiment FAILED — weakness detected, creating remediation ticket" # CI/CD integration: fail the pipeline exit 1 fi
09

Other Tools

// THE CHAOS ENGINEERING ECOSYSTEM
ToolTypeBest ForModel
Chaos Monkey Instance termination AWS EC2 random termination — the original. Production-only. Tests auto-scaling and self-healing infrastructure. OSS
Chaos Toolkit Multi-target CLI Open-source, extensible, driver-based. AWS, GCP, Azure, Kubernetes, Prometheus integrations. Best for scripted, composable experiments via JSON/YAML. OSS
AWS Fault Injection Service (FIS) AWS-native Managed chaos for EC2, ECS, EKS, RDS, DynamoDB. Native AWS integration — no agent. IAM-controlled. Ideal for AWS-centric stacks. MANAGED
Azure Chaos Studio Azure-native Azure PaaS/IaaS fault injection — VM, AKS, CosmosDB, App Service, network, disk. Integrated with Azure Monitor for abort conditions. MANAGED
tc (Linux Traffic Control) Network fault primitive Kernel-level network emulation. The underlying mechanism most tools use. Useful for custom network experiments without additional tooling. OSS
stress-ng Resource stress CPU, memory, I/O, and OS resource stressors. Foundational tool for manual resource exhaustion testing before adopting a full chaos platform. OSS
ChaosBlade Multi-layer Host, container, Kubernetes, JVM (Java/Node) fault injection. Strong Java process-level chaos (method-level delays/exceptions). CNCF project. OSS

AWS Fault Injection Service — Quick Example

JSON — AWS FIS Experiment Template
{ "description": "Inject CPU stress on 1 production ECS task", "targets": { "PaymentECSTasks": { "resourceType": "aws:ecs:task", "resourceTags": { "service": "payment-api", "env": "production" }, "selectionMode": "COUNT(1)" // exactly 1 task } }, "actions": { "InjectCPUStress": { "actionId": "aws:ecs:cpu-stress", "parameters": { "duration": "PT2M", // ISO 8601: 2 minutes "percent": "80", // 80% CPU "workers": "0" // 0 = use all available cores }, "targets": { "Tasks": "PaymentECSTasks" } } }, "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "payment-api-error-rate-high" // halt if alarm fires } ], "roleArn": "arn:aws:iam::123456789:role/FISExperimentRole", "tags": { "purpose": "chaos-experiment", "experiment-id": "CHE-0042" } }
10

Resource Attacks

// CPU · MEMORY · DISK · I/O

Resource attacks simulate the conditions of runaway processes, memory leaks, disk-filling log floods, and I/O saturation. These are among the most common real-world failure modes — every system should be tested against them before they happen accidentally.

CPU Exhaustion Common
  • What it tests: CPU throttling behavior, request queue growth, timeout propagation, HPA scaling trigger time
  • What good looks like: autoscaler provisions new pods within SLO window; graceful degradation of low-priority work; no cascading failures to downstream services
  • Tool: stress-ng --cpu 4 --cpu-load 90 --timeout 120s
  • Gremlin: CPU attack on 1 pod at 90% for 120s, observe HPA events
Memory Exhaustion Important
  • What it tests: OOM killer behavior, container restart policy, memory limits vs requests, pod eviction priority
  • What good looks like: OOM kills the specific container; k8s restarts within 30s; service remains available via other replicas; alerting fires
  • Tool: stress-ng --vm 1 --vm-bytes 4G --timeout 60s
  • Tip: Test both the hit-limit-and-OOM path and the slow-leak path (gradual consumption over hours)
Disk Fill Critical
  • What it tests: log rotation effectiveness, ephemeral storage limits, disk-full error handling in databases
  • What good looks like: service degrades gracefully (not crashes); alerts fire before 100%; log rotation kicks in; recovery after space is freed
  • Tool: fallocate -l 10G /tmp/chaos-fill (cleanup: rm /tmp/chaos-fill)
I/O Saturation Often Missed
  • What it tests: disk I/O scheduling under load, read vs write contention, database buffer pool behavior under slow disk
  • What it reveals: applications that don't set I/O timeouts, synchronous I/O blocking async event loops, database lock contention amplification
  • Tool: stress-ng --io 4 --hdd 2 --timeout 120s
11

Network Chaos

// LATENCY · LOSS · PARTITION · CORRUPTION

Network chaos is the highest-value category of experiments for distributed systems. The network is the most common and most variable failure mode in cloud infrastructure. Microservice architectures have hundreds of network paths — every one of them needs circuit breakers, timeouts, and retries that actually work under adverse conditions.

Fault TypeModelsWhat It Validatestc Command
Latency Injection Fixed delay, jitter, distribution Timeout configs are set; circuit breakers open at correct threshold; p99 stays within SLO tc qdisc add dev eth0 root netem delay 200ms 50ms
Packet Loss % random loss, burst loss (Gilber model) TCP retransmit behavior; application-level retry logic; error rate under partial loss tc qdisc add dev eth0 root netem loss 10%
Packet Corruption Random bit errors Protocol-level error detection; TLS integrity; application checksum validation tc qdisc add dev eth0 root netem corrupt 5%
Packet Duplication % duplicate packets Idempotency of request handlers; deduplication logic; double-charge prevention tc qdisc add dev eth0 root netem duplicate 5%
Bandwidth Throttle Rate limiting to N Mbit/s Large payload transfers (uploads, exports); streaming behavior under constrained bandwidth tc qdisc add dev eth0 root tbf rate 1mbit burst 32kbit latency 400ms
Network Partition Complete blackhole to specific host/port Circuit breaker opens; fallback activates; database failover works; split-brain handling iptables -A OUTPUT -d 10.0.1.5 -j DROP
DNS Failure NXDOMAIN, SERVFAIL, timeout DNS caching behavior; connection pooling reuse; service discovery failure path Block UDP port 53 via iptables or corrupt /etc/resolv.conf

Validating Circuit Breakers

Experiment — Circuit Breaker Validation
Experiment: CHE-0021 Hypothesis: "When fraud-service returns 100% HTTP 503 for 30s, the circuit breaker in payment-api opens within 10s, and payments fall back to allow-by-default, maintaining payment success rate > 99.5%." SETUP: Inject HTTP 503 responses using a Toxiproxy fault proxy: # Deploy Toxiproxy as a sidecar or standalone fault proxy docker run -d --name toxiproxy \ -p 8474:8474 \ # API port -p 9000:9000 \ # proxied port ghcr.io/shopify/toxiproxy # Create proxy: payment-api → fraud-service via toxiproxy curl -X POST http://localhost:8474/proxies \ -H "Content-Type: application/json" \ -d '{"name": "fraud-svc", "listen": "0.0.0.0:9000", "upstream": "fraud-service:8080"}' # INJECT: Return HTTP 503 on all requests (simulates service crash) curl -X POST http://localhost:8474/proxies/fraud-svc/toxics \ -H "Content-Type: application/json" \ -d '{ "name": "http-503-fault", "type": "limit_data", "attributes": {"bytes": 0} }' # OBSERVE: Watch circuit breaker state in payment-api metrics # Look for: resilience4j_circuitbreaker_state{name="fraud-service"} == 2 (OPEN) watch -n 2 curl -s http://payment-api:8080/actuator/health | jq . # CLEANUP: Remove toxic (restore normal behavior) curl -X DELETE http://localhost:8474/proxies/fraud-svc/toxics/http-503-fault
12

State & Process Chaos

// POD KILLS · PROCESS CRASHES · CLOCK SKEW · ZONE FAILURE
Pod / Process Termination High Value

The most basic chaos experiment. Kill one or more pods and verify the service remains available. Tests Kubernetes scheduling, readiness probes, pod disruption budgets, and graceful shutdown behavior (SIGTERM → drain → SIGKILL).

kubectl delete pod payment-api-7d9f4b-xk2zr -n production

Key checks: downtime during pod death, replacement scheduling time, connection draining (no in-flight 500s during graceful shutdown), PDB enforcement prevents all-pods deletion.

Time Travel / Clock Skew Subtle

Shift the system clock forward or backward on target nodes. Tests: JWT expiry validation (tokens suddenly "expired"), certificate validity window checks, distributed lock TTLs, time-series data ordering, cron job timing. Often reveals hidden assumptions about clock synchronization.

date -s "+2 hours" — Gremlin time-travel attack: ±1h to ±30d
Availability Zone Failure Critical

Simulate the loss of an entire AZ by blackholing all traffic to that zone's CIDRs, stopping all instances in that AZ, or removing them from load balancer target groups. This is the experiment that reveals whether your "multi-AZ deployment" is actually active-active or just active-passive with undiscovered split-brain conditions.

Dependency Timeouts Systemic

Inject high latency (10s, 30s, 60s) on calls to external dependencies — not to cause failures, but to verify that every service has a configured timeout and that timeout expiry is handled gracefully. The most common real-world failure: a dependency hangs indefinitely, exhausting the caller's connection pool.

Bash — Graceful Shutdown Verification
# Test that in-flight requests complete before pod terminates # Run load against the service while deleting the pod # Terminal 1: generate continuous load hey -z 120s -c 10 -q 50 https://payment-api.internal/v1/checkout # Terminal 2: delete a specific pod while load is running POD=$(kubectl get pods -n production -l app=payment-api -o name | head -1) kubectl delete $POD -n production # EXPECTED RESULT: # - Zero HTTP 5xx errors during pod deletion # - pod terminates after preStop hook drains connections # - hey summary shows 0% error rate # Verify terminationGracePeriodSeconds is sufficient kubectl get pod $POD -n production -o jsonpath='{.spec.terminationGracePeriodSeconds}' # Verify preStop hook is configured kubectl get deployment payment-api -n production \ -o jsonpath='{.spec.template.spec.containers[0].lifecycle}' | jq .
13

GameDay Runbook

// RUNNING LIVE CHAOS EVENTS

A GameDay is a structured event where a team runs chaos experiments together, testing both technical resilience and the human/process response — alert quality, runbook accuracy, oncall ergonomics, and communication flows. GameDays are the chaos engineering equivalent of a fire drill.

GameDay Runbook — Full Template
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ GAMEDAY: Payment Service Resilience Date: 2024-Q3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Facilitator: @sre-lead Observers: @payment-team, @platform-team War Room: #gameday-2024-q3 (Slack) Dashboard: https://datadog.company.com/dashboard/gameday-q3 Rollback Owner: @oncall-engineer (has kill-switch access) Duration: 3 hours (14:00 - 17:00 UTC) PRE-GAME CHECKLIST (30 minutes before) ☐ All experiment scripts prepared and tested in staging ☐ Dashboards verified streaming — no data gaps ☐ PagerDuty configured: test alerts suppressed, real alerts active ☐ Rollback procedures documented and reviewed with team ☐ Stakeholder notification sent (support, leadership aware) ☐ Baseline metrics captured (last 1h avg for all key metrics) ☐ Kill-switch access confirmed: @oncall has gremlin halt + kubectl access ☐ External dependency owners on standby (DB team, infra team) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SCENARIO 1 — 14:00 UTC: Redis Cache Unavailability (20 min) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Hypothesis: "Cache miss fallback serves all requests from DB. Success rate > 99.5%. Latency p99 < 500ms." Fault: Gremlin blackhole on port 6379 — 1 of 3 payment-api pods Duration: 10 min fault + 10 min recovery observation Abort: success_rate < 99.0% for 30s OR any P1 fires Observer: Watch payment.success_rate, payment.latency.p99 SCENARIO 2 — 14:30 UTC: Downstream Timeout Injection (20 min) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Hypothesis: "5s timeout on fraud-service calls triggers circuit breaker. Payments default to ALLOW. Zero failures." Fault: Toxiproxy latency 6000ms on fraud-service egress Duration: 8 min fault + 10 min CB state recovery observation Abort: Any payment transaction errors > 0.5% for 60s SCENARIO 3 — 15:00 UTC: Pod Kill Cascade (30 min) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Hypothesis: "Sequential termination of 2 of 3 payment-api pods maintains service availability. k8s schedules replacements within 60s. p99 < 400ms throughout." Fault: kubectl delete 1 pod, wait 45s, delete 1 more Duration: 20 min (including pod scheduling observation) Abort: PDB violation (k8s should prevent 3rd pod deletion) POST-GAME PROCESS (17:00 UTC) 1. IMMEDIATE (same day): ☐ Document all findings in #gameday-findings ☐ Confirm all faults removed (no lingering tc rules, toxics, etc.) ☐ Verify all metrics returned to pre-game baseline ☐ Create tickets for all identified weaknesses (P1 within 24h) 2. WITHIN 1 WEEK: ☐ Post-mortem document published ☐ Remediation tickets assigned with owners ☐ New chaos experiments created from discovered weaknesses ☐ Next GameDay date set (target: quarterly cadence)
14

Observability for Chaos

// METRICS · LOGS · TRACES · EVENTS

Chaos Engineering without observability is just breaking things. You need sufficient signal before you start — not just to detect problems, but to validate the hypothesis. The experiment is only as good as your ability to measure its impact.

Signal TypeWhat to CaptureToolingPurpose
Steady-State Metrics Request rate, error rate, latency (p50/p95/p99), saturation (CPU/mem/connections), SLO burn rate Prometheus, Datadog, CloudWatch Hypothesis validation — did the metric cross the threshold?
Experiment Events Fault start/stop timestamps, target identifiers, experiment ID — injected as annotations Datadog Events API, Grafana Annotations Correlate metric changes to exact fault injection moment
Distributed Traces Request traces from client through all services — show where latency or errors originate Jaeger, Zipkin, Datadog APM, OTEL Identify which hop in the call chain is affected by the fault
Kubernetes Events Pod scheduling, OOM kills, evictions, autoscaler decisions, PDB violations kubectl get events -n production --sort-by='.lastTimestamp' Verify cluster response to pod failures — scheduling time, restart counts
Structured Logs Circuit breaker state changes, fallback activations, retry counts, timeout events ELK, Datadog Log Analytics, CloudWatch Insights Confirm the intended failure mode triggered the right fallback path

Annotating Experiments in Dashboards

Python — Auto-annotate Datadog during experiment
import time from datadog import initialize, api initialize(api_key="$DD_API_KEY", app_key="$DD_APP_KEY") EXPERIMENT_ID = "CHE-0042" START_TIME = int(time.time()) # Mark experiment START on all dashboards api.Event.create( title=f"[CHAOS] {EXPERIMENT_ID} — START", text="Injecting: Redis blackhole on payment-api (1 of 3 pods). Hypothesis: success_rate > 99.8%", tags=["chaos", "env:production", "service:payment-api", f"experiment:{EXPERIMENT_ID}"], alert_type="warning", source_type_name="chaos-engineering" ) # — RUN EXPERIMENT — time.sleep(120) # experiment duration # Mark experiment END api.Event.create( title=f"[CHAOS] {EXPERIMENT_ID} — COMPLETE", text="Fault injection complete. Entering recovery observation.", tags=["chaos", "env:production", f"experiment:{EXPERIMENT_ID}"], alert_type="success" ) # Query metrics during experiment window results = api.Metric.query( start=START_TIME, end=int(time.time()), query=f"avg:payment.success_rate{{env:production}}.rollup(min, 10)" ) min_success_rate = min(point[1] for point in results['series'][0]['pointlist']) print(f"Min success rate during experiment: {min_success_rate:.3%}") print(f"Hypothesis: {'PASS ✓' if min_success_rate >= 0.998 else 'FAIL ✗'}")
15

Chaos Maturity Model

// WHERE ARE YOU TODAY

The Chaos Engineering Maturity Model measures how systematically an organization applies chaos engineering — from ad-hoc manual experiments to continuous automated resilience verification across the full production stack.

Stage 01
Ad-hoc
  • Manual, one-off experiments
  • No formal hypothesis
  • Dev/staging only
  • No tracking or reporting
  • Hero-driven (one SRE does it)
  • No runbooks or rollback
Stage 02
Defined
  • Hypothesis-driven experiments
  • Documented templates exist
  • Quarterly GameDays
  • Blast radius documented
  • Findings tracked as tickets
  • Basic observability in place
Stage 03
Managed
  • Platform in place (Gremlin/Litmus)
  • Regular prod experiments
  • Reliability score tracked
  • CI/CD integration (some tests)
  • Xfn ownership across teams
  • Post-mortems include chaos tasks
Stage 04
Optimized
  • Continuous automated experiments
  • Every incident → new experiment
  • Full MITRE-style coverage map
  • Dynamic, risk-based targeting
  • Chaos as a release gate
  • Org-wide resilience SLOs
Start with Stage 2, not Stage 4: Most teams over-invest in tooling before building the cultural practices — hypothesis documentation, gameday cadence, findings tracking. A quarterly GameDay with a spreadsheet and kubectl delete pod delivers more value than a fully instrumented platform that nobody runs experiments on. Nail the scientific discipline first, then automate it.
16

Implementation Roadmap

// GETTING FROM ZERO TO CONTINUOUS CHAOS
12-Month Chaos Engineering Roadmap
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PHASE 0 — FOUNDATIONS (Month 1-2) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✦ Observability audit — do you have the signals to run experiments? (Prometheus metrics, distributed tracing, structured logs, dashboards) ✦ Incident post-mortem review — identify top 5 failure modes from history ✦ Select 1 service as pilot — ideally one with good coverage and a team willing to own the experiment program ✦ Run first experiment in staging: pod-delete on the pilot service ✦ Establish hypothesis template and experiment log (even a spreadsheet) ✦ Identify GameDay facilitator and oncall rotation integration ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PHASE 1 — CONTROLLED PRODUCTION (Month 3-5) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✦ Install chaos tool: LitmusChaos (Kubernetes) or Gremlin (multi-target) ✦ First production experiment: pod-delete on 1 pod of pilot service ✦ Run monthly GameDays with the pilot team — 2-3 scenarios each ✦ Document all weaknesses found; create remediation tickets ✦ Expand experiment catalog: network latency, dependency blackhole ✦ Build abort-condition automation: Prometheus alert → halt webhook ✦ Share findings with broader engineering org (build momentum) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PHASE 2 — SCALE ACROSS SERVICES (Month 6-9) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✦ Expand to 5+ services — each team owns their experiment backlog ✦ Chaos experiment template in sprint planning: each quarter, 2 new exps ✦ Post-mortem → experiment policy: every P1/P2 incident gets a follow-up chaos experiment that reproduces the failure mode ✦ AZ failure simulation: test active-active multi-AZ claim for top services ✦ Database failover: test RDS/CloudSQL promotion for all stateful services ✦ Resource attack catalog: CPU/memory/disk on all services at least once ✦ Quarterly cross-service GameDay: invite leadership to observe ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ PHASE 3 — CONTINUOUS AUTOMATION (Month 9-12) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✦ Integrate lightweight chaos experiments into CI/CD pipelines (pod-delete, latency injection run automatically on staging deploys) ✦ Schedule continuous low-blast-radius experiments in production (e.g., random pod-delete during business hours, every weekday) ✦ Reliability score / coverage heatmap — track coverage vs MITRE faults ✦ Resilience SLO: "All Tier-1 services must pass pod-delete experiment" ✦ Self-service chaos: teams run their own experiments via portal/CLI ✦ Chaos-as-code: experiments in Git, reviewed in PRs, run in pipelines
Common Failure Modes Avoid
  • Running experiments without observability in place first
  • Starting in production before staging validation
  • No abort criteria — experiment runs until it causes an incident
  • Fixing the weakness and not re-running the experiment to verify
  • Siloed — one SRE runs chaos, teams don't own experiments
  • Focusing on tooling before hypothesis discipline
  • Running chaos during major product launches or freeze windows
Success Factors Must-Have
  • Observability precedes every experiment — measure first
  • Stage experiments: sandbox → staging → canary → production
  • Always have a human kill-switch and abort criteria defined
  • Every weakness found is remediated and then re-tested
  • Teams own their chaos backlogs — chaos is a team responsibility
  • Post-mortems generate experiments — close the feedback loop
  • Executive visibility through reliability score dashboards
Reference Material: Principles of Chaos Engineering (principlesofchaos.org) · Netflix Tech Blog: Chaos Engineering (the original) · Google SRE Book: Chapter 17 (Testing for Reliability) · AWS Well-Architected: Reliability Pillar · LitmusChaos Docs (litmuschaos.io) · Gremlin Chaos Engineering Handbook (gremlin.com/community) · CNCF Chaos Engineering TAG Working Group