Introduction

If you have ever been involved in a postmortem call then you know the most popular root cause approach is deny deny deny 😊. All joking aside, as a problem manager, the Root Cause Analysis (RCA) is more than just asking "why" repeatedly. It's a structured approach or dare I say, it's an art to problem-solving that can transform how organizations handle incidents.

Lets walk through 3 popular RCA methods: 5 Whys, Ishikawa, and Fault Tree Analysis.

*Note: The following article only covers RCA methods and does not cover the entire RCA process, including identifying contributing factors such as monitoring, MTTR, MTTD, etc.*

Case Study 1: The Recurring Network Outage

The Scenario

A financial services company experienced intermittent network outages every Friday afternoon, impacting customer transactions.

The Analysis: 5 Whys Technique

Why 1: Why did the network go down?
- Because the network bandwidth was maxed out

Why 2: Why was the bandwidth maxed out?
- Because of a spike in backup processes

Why 3: Why was there a backup spike at that time?
- Because all department backups were scheduled for Friday 2 PM

Why 4: Why were all backups scheduled simultaneously?
- Because there was no backup schedule coordination between departments

Why 5: Why wasn't there coordination?
- Because there was no centralized backup policy management

Preventative Actions

Implementation of a staggered backup schedule and centralized backup policy reduced network load by 60% and eliminated the outages.

Case Study 2: The Mysterious Application Crash

The Scenario

A critical business application would crash unpredictably, with no clear pattern in the logs.

The Analysis: Ishikawa (Fishbone) Diagram

Fishbone Diagram Example

The fishbone analysis revealed multiple contributing factors:

Preventative Actions

A comprehensive list of preventative actions including code optimization, automated memory management process, new alert email distribution list, and team training resulted in 99.9% uptime and quicker responses in the event of future issues.

Case Study 3: The Payment Processing System Failure

The Scenario

A financial services company experienced recurring system failures in their payment processing platform, leading to service disruptions and customer dissatisfaction.

The Analysis: Fault Tree Analysis

Fault Tree Analysis Example

The Fault Tree Analysis revealed that system failures occurred when both equipment failures and operator errors were present:

Preventative Actions

Based on the FTA findings, we would propose the following preventative actions:

Key Takeaways

Conclusion

There is more than one way to skin a cat and the same goes for RCA. At the end of the day, an effective RCA if done properly, pinpoints the true root cause of the incident and identifies preventative actions that will reduce the likelihood of the incident occurring again and reduce the impact of the incident if it does occur. By applying these techniques appropriately, you can transform reactive problem-solving into proactive problem prevention.