Fishbone Diagrams for Software Incidents
The fishbone diagram — also called the Ishikawa diagram — is a root cause analysis tool from manufacturing that I have adapted heavily for software incident analysis. When used well, it forces a team to explore causes systematically rather than jumping to the first explanation that sounds plausible.
How I Adapt It for Software
In manufacturing, the standard categories are Man, Machine, Method, Material, Measurement, and Environment. For software incidents, I use a modified set: People and Skills, Process, Technology and Infrastructure, Codebase, Monitoring, and External Dependencies.
Each category becomes a branch on the diagram, and the team brainstorms potential contributing factors under each branch. The key word is "contributing" — I discourage teams from looking for "the" root cause because complex incidents rarely have just one.
A Real Example
We had a production incident where an API started returning stale data after a deployment. The initial instinct was to blame the code change. But when we built the fishbone diagram, a richer picture emerged.
Under Process, we identified that the deployment happened without the standard cache invalidation step because the runbook had not been updated to include the new caching layer. Under Technology, the caching configuration had no TTL set for that specific endpoint. Under Monitoring, there was no alert for cache staleness — only for API errors, which were not occurring. Under People, the engineer who deployed was new to the team and had not been onboarded on the caching architecture.
No single factor caused the incident. All four had to be true simultaneously for it to happen. The corrective actions addressed all four: updated runbook, mandatory TTL configuration, staleness monitoring, and an onboarding checklist for caching architecture.
Running the Session
I timebox fishbone sessions to forty-five minutes. I draw the diagram on a shared whiteboard or Miro, state the incident clearly as the "head" of the fish, and then work through each category one at a time. I ask the team to hold judgment until all branches are populated. Only then do we prioritize which contributing factors to address.
The visual nature of the diagram makes it far more effective than a freeform discussion. People think differently when they can see the structure.
←Back to all posts