Root Cause Analysis Beyond the Five Whys
Everyone knows the Five Whys. Ask "why" five times, find the root cause, fix it. In practice, I have found this technique works well for simple, linear problems — and fails badly for the complex, systemic issues that actually plague engineering programs.
Where Five Whys Breaks Down
The Five Whys assumes a single causal chain. But production incidents in distributed systems rarely have one root cause. They have contributing factors, latent conditions, and triggering events that combine in ways nobody predicted. When you force a complex failure into a linear chain, you end up with a superficial answer and a corrective action that does not actually prevent recurrence.
I watched a team use Five Whys on a payment processing outage and arrive at "the engineer did not test the edge case." The corrective action was "add more tests." Three months later, a similar outage occurred because the actual systemic issue — inadequate staging environment parity — was never addressed.
What I Use Instead
Fishbone Diagrams (Ishikawa): For incidents with multiple contributing factors, I map causes across categories — People, Process, Technology, Environment. This forces the team to think broadly rather than chasing the first plausible explanation.
Fault Tree Analysis: For critical failures, I work backward from the undesired event and map all the conditions that had to be true for it to occur. This is more rigorous but exceptionally useful for high-severity incidents.
Timeline Reconstruction: Before analyzing causes, I build a detailed timeline with the team. What happened, when, and who knew what at each point. This prevents hindsight bias from contaminating the analysis.
The Cultural Element
The technique matters less than the culture. If people fear blame, no methodology will surface the real causes. I explicitly frame every root cause analysis as a system investigation, not a people investigation. The question is never "who made the mistake" but "what made the mistake possible." That distinction changes everything about the quality of information you get.
←Back to all posts