What is Root Cause Analysis?

Root Cause Analysis (RCA) is the discipline of refusing to stop at the first plausible explanation for a failure. It is what separates teams that solve a problem once from teams that solve a slightly different version of the same problem every quarter.

The premise is simple. Every visible failure — a defect, an outage, a complaint, an audit finding — sits on top of a chain of causes. Most of those causes are about the work itself: a process step, a decision rule, a piece of training, a system default. RCA walks that chain backwards until you reach a cause you can change. Then you change it.

That is the whole job. Everything else is technique.

Symptom, cause, root cause

Most teams already know the difference between a symptom (the thing the customer or auditor noticed) and a cause (the thing that produced the symptom). The harder distinction is between a cause and a root cause.

A useful definition: a root cause is the most upstream thing your team can change that, once changed, removes the failure from your future. Three tests:

Actionable. It is inside your control or your team's control. "The supplier shipped the wrong part" is a cause. "We have no incoming inspection step that would catch a wrong part" is closer to a root cause.
Preventive, not detective. Adding a second inspector to the line catches the next defect. It does not remove the conditions that produced the first one. A root-cause fix changes the conditions.
Stops the recurrence. If you implement the fix and the same class of failure still happens, you didn't find the root cause. You found a cause.

A lot of "RCA" output fails the third test. It reads like a list of plausible contributors with one of them circled. The way you know it's not a real root-cause analysis is that two quarters later, the same problem comes back wearing a slightly different hat.

Why teams stop too early

Three failure modes dominate.

Stopping at human error. "Operator made the wrong call." "Engineer pushed the wrong button." "Nurse misread the chart." Every one of these is a cause. None of them are root causes. They describe where the failure surfaced, not why the system made that failure available. A blameless walk asks the next question: what part of the work let a single mistake propagate this far?

Stopping at the first satisfying narrative. Five Whys is a famously easy method that fails in famously predictable ways. The most common is that the team latches onto the first explanation that feels right and stops asking. The second is that the chain branches — there are usually three or four whys at each step, not one — and the team only walks the branch that is easiest to talk about. Discipline beats technique here.

Stopping at "we will be more careful." A corrective action that has no failure mode of its own — no missed step, no skipped check, no thing that can be left off a Friday afternoon checklist — usually isn't a corrective action. It is a reminder. Reminders decay.

A defensible RCA actively resists all three.

The six steps, honestly

Most RCA frameworks share the same skeleton. The version below is the one we recommend, with the parts that usually get skimmed marked.

Define the problem. Be specific about what, where, when, and how often. "Valve leakage" is not a problem statement. "8% leakage rate on Line B, final inspection, past 14 days, against a 0.5% historical baseline" is. The numbers anchor the rest of the analysis.
Gather the data. Process records, test results, maintenance logs, operator notes, complaint text — whatever exists for this failure, not a similar one you remember. Most teams skip this step or do a half-version of it, and then spend the rest of the analysis arguing about what they remember.
Walk the chain. Ask "why" until further whys stop changing anything actionable. At each step, look for the branches — most failures have more than one contributing cause at most levels. Capture the branches even if you don't walk them; an RCA that pretends a failure was a single chain is hiding evidence.
Pick the root causes. Note the plural. Real failures usually have two to three root causes that interact. The corrective actions need to cover all of them; fixing only the easiest one is how problems come back.
Implement the fix. A corrective action that changes a process, a default, or a control is durable. A corrective action that asks people to remember a new rule is not. If the only thing you can write down is "the team will be more careful," go back to step 4.
Verify it stuck. Pick a metric that was wrong before — the leakage rate, the recurrence count, the cycle-time variance — and watch it for long enough to know. "We added the fix and haven't heard about the problem since" is not verification. It is silence.

The order matters less than the discipline of finishing each step before starting the next.

Why the artifact matters

One quick aside, because it explains why we built RCA Map.

Most RCA outputs end up as a Word doc, a fishbone scrawled on a whiteboard photo, or a row in a tracker. None of those shapes match the way the team actually thought about the failure. The thinking is a branching tree of causes; the artifact is flat. The mismatch is where most of the rigor leaks out.

RCA Map keeps the artifact in the same shape as the thinking — a live, branching tree of causes you can walk, prune, and share. We'll come back to that in other posts. Back to method.

Where RCA goes wrong

The failure modes are predictable enough to enumerate.

The analysis stops at who instead of what. The fix is to ban human-name terminal nodes from the tree. Force the next "why."
The corrective action is detective, not preventive. The fix is to ask: "what would have to be true upstream for this to not have happened?"
The team writes a CAPA, files it, and never closes the loop. The fix is a hard verification step with an owner, a date, and a metric.
The same problem comes back six months later in a different system. The fix is to read your own backlog. A recurring failure across systems is a signal that the root cause is policy, not process.

None of these are exotic. Teams know about them. The reason they keep happening is that the artifact and the workflow don't make them easy to avoid.

The mindset

RCA is not a checkbox. It is a way of treating failures as a source of information about how your system actually works, rather than a source of blame for the people who happened to be inside it when it broke. Get that part right and the techniques mostly take care of themselves.

Solve the problem once. Fix the cause, not the symptom.