Getting Gremlin Certified: My Chaos Engineering Journey Begins
Last week I passed the Gremlin GECEC certification. A PM getting certified in chaos engineering probably seems unusual. But after watching two major incidents expose gaps that traditional testing missed, I decided I needed to understand resilience engineering properly — not just conceptually.
Why a PM cares about chaos engineering
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. As a PM, I care because every outage is a customer impact event, and I'm the one explaining it to stakeholders.
Understanding chaos engineering lets me ask better questions: What happens when this service goes down? Have we tested our failover? What's our blast radius if the database latency spikes?
What the certification taught me
Start small. The Gremlin approach emphasizes starting with the smallest possible experiment. Don't kill production servers on day one. Start by adding 50ms of latency to one service and observing what happens. You'll be surprised how much breaks.
Steady state hypothesis. Before running any experiment, define what "normal" looks like. If you can't describe your system's steady state, you can't measure the impact of failures. This principle alone changed how I think about monitoring.
Blast radius control. Every experiment should have a clearly defined blast radius and an abort condition. This is what separates chaos engineering from just breaking things.
How I'll use this
I'm planning to introduce game days — structured chaos experiments — into our quarterly reliability process. The goal isn't to find bugs. It's to build confidence that our systems degrade gracefully under failure conditions.
I also want to use chaos engineering principles to stress-test our incident response process. Do our runbooks actually work? Does the on-call rotation know what to do? The only way to find out is to simulate failure.
This certification is just the beginning. More posts on implementation coming as we run our first experiments.
←Back to all posts