Chaos Engineering: Planning Your First Game Day
After getting my GECEC certification, I spent two weeks planning our team's first game day. A game day is a structured chaos engineering session where you deliberately inject failures into your system and observe how it responds. Here's how to plan one without causing an actual outage.
Pre-game day (the critical part)
Define your steady state. What does "normal" look like? For us, that meant: API response times under 200ms, error rate below 0.1%, and all health checks passing. If you can't measure steady state, you're not ready for game day.
Choose your experiment. Start boring. Our first experiment was adding 100ms of network latency to our primary database connection. Not killing the database — just slowing it down. Small experiments reveal big insights.
Set abort conditions. "If error rate exceeds 5% or any P0 alerts fire, we stop immediately." Write this down. Agree on it. Make sure everyone knows who has the kill switch.
Inform stakeholders. Your customer support team needs to know. Your on-call team needs to know. Your leadership needs to know. Surprise chaos experiments are just outages with extra steps.
During game day
Run the experiment for a fixed window. We did 15 minutes. That's enough to observe impact without causing prolonged degradation.
Observe, don't fix. The point is to learn how your system behaves, not to frantically fix things in real time. Document what breaks, what degrades, and what holds up fine.
Take notes in real time. Assign someone to be the scribe. Memory is unreliable under stress.
What we learned
Our circuit breakers didn't trip at 100ms latency — they were configured for 500ms timeouts. Our retry logic created a cascade that amplified the latency. And our monitoring dashboard didn't show the issue for 4 minutes because our metrics had a 5-minute aggregation window.
One afternoon of controlled failure taught us more about our system than six months of normal operation.
←Back to all posts