The SRE Manifesto
Site Reliability Engineering Practice
Chaos Engineering
| Practice code | Practice area(s) | Practice name | Practice description | Practice applicability | Practice technology(ies) | Implementation steps |
|---|---|---|---|---|---|---|
| AUT101 | [x] Automation; [x] Systems Thinking | Chaos Engineering | Experiment on a system in order to build confidence in its capability to withstand turbulent conditions in production | Distributed systems, microservices architectures, cloud platforms | Chaos Monkey, Gremlin, LitmusChaos, AWS Fault Injection Simulator | 1. Define steady-state hypothesis; 2. Identify real-world events (e.g. server crash); 3. Run automated experiments in staging/prod; 4. Analyze results and improve system resilience. |
Source: Principles of Chaos Engineering