Mastering Root Cause Analysis: From 5 Whys to Ishikawa

Introduction

If you have ever been involved in a postmortem call then you know the most popular root cause approach is deny deny deny 😊. All joking aside, as a problem manager, the Root Cause Analysis (RCA) is more than just asking "why" repeatedly. It's a structured approach or dare I say, it's an art to problem-solving that can transform how organizations handle incidents.

Lets walk through 3 popular RCA methods: 5 Whys, Ishikawa, and Fault Tree Analysis.

*Note: The following article only covers RCA methods and does not cover the entire RCA process, including identifying contributing factors such as monitoring, MTTR, MTTD, etc.*

Case Study 1: The Recurring Network Outage

The Scenario

A financial services company experienced intermittent network outages every Friday afternoon, impacting customer transactions.

The Analysis: 5 Whys Technique

Why 1: Why did the network go down?
- Because the network bandwidth was maxed out

Why 2: Why was the bandwidth maxed out?
- Because of a spike in backup processes

Why 3: Why was there a backup spike at that time?
- Because all department backups were scheduled for Friday 2 PM

Why 4: Why were all backups scheduled simultaneously?
- Because there was no backup schedule coordination between departments

Why 5: Why wasn't there coordination?
- Because there was no centralized backup policy management

Preventative Actions

Implementation of a staggered backup schedule and centralized backup policy reduced network load by 60% and eliminated the outages.

Case Study 2: The Mysterious Application Crash

The Scenario

A critical business application would crash unpredictably, with no clear pattern in the logs.

The Analysis: Ishikawa (Fishbone) Diagram

The fishbone analysis revealed multiple contributing factors:

Technology: Memory leaks in the application
Process: No regular process for memory cleanup
People: Junior developers unaware of memory management best practices
Monitoring: Memory threshold alerts went to the wrong team

Preventative Actions

A comprehensive list of preventative actions including code optimization, automated memory management process, new alert email distribution list, and team training resulted in 99.9% uptime and quicker responses in the event of future issues.

Case Study 3: The Payment Processing System Failure

The Scenario

A financial services company experienced recurring system failures in their payment processing platform, leading to service disruptions and customer dissatisfaction.

The Analysis: Fault Tree Analysis

The Fault Tree Analysis revealed that system failures occurred when both equipment failures and operator errors were present:

Equipment Failures (OR gate):
- Mechanical Failures:
  - Substandard material quality in server components
  - Mechanical wear and tear due to lack of maintenance
- Electrical Failures:
  - Faulty wiring in the data center
  - Power surges affecting critical systems
Operator Errors (AND gate):
- Inadequate Training:
  - Insufficient training programs for new staff
  - High turnover leading to less experienced staff
- Operator Fatigue:
  - Extended shift hours without adequate breaks
  - High workload and pressure during critical incidents

Preventative Actions

Based on the FTA findings, we would propose the following preventative actions:

Equipment Improvements:
- Implemented strict quality control for hardware components
- Established regular maintenance schedules
- Upgraded electrical systems with surge protection
- Conducted thorough wiring inspections and repairs
Operator Support:
- Developed comprehensive training programs
- Implemented better shift management
- Increased staffing levels to reduce workload
- Created mandatory break schedules

Key Takeaways

Choose the right RCA tool for the problem complexity
Involve all stakeholders in the analysis process
Document findings and solutions thoroughly
Implement preventive measures, not just fixes

Conclusion

There is more than one way to skin a cat and the same goes for RCA. At the end of the day, an effective RCA if done properly, pinpoints the true root cause of the incident and identifies preventative actions that will reduce the likelihood of the incident occurring again and reduce the impact of the incident if it does occur. By applying these techniques appropriately, you can transform reactive problem-solving into proactive problem prevention.