Programs, Architecture & Analytics

Root Cause Analysis Principles

A Model for Resilience Based on ResearchBefore jumping into the tools, let’s discuss some of the basic guiding principles, best practices and benefits of Root Cause Analysis (RCA).  In addition (where appropriate), we will provide examples and explanations on how these principles are used.  The primary source of these principles was the RCA Wiki but we added to and changed principles to make the listing more comprehensive and to provide clarifications and details.

  1. The primary goal of root cause analysis is to identify the source(s) (i.e. root causes) and causal factors (also known as: contributing factors) as related to a negatively impacting event commonly referred to as: problems, accidents, incidents, failures, deviations, or non-conformances.
  2. The purpose of identifying root cause(s) and causal factor(s) is to eliminate or mitigate the potential of these variables from recurring and therefore eliminating or mitigating the potential for a negatively impacting recurrence.
  3. The act of taking action to eliminate or mitigate root cause(s) or causal factor(s) is commonly referred to as Corrective Action and Preventative Action (CAPA) Management.  It is good practice to identify CAPA items at the lowest possible cost of implementation and/or to identify a variety of CAPA options dependent on cost and effectiveness.
  4. It is good practice to identify root cause(s) and causal factor(s) within an operational framework.  In other words, root causes and causal factors should be identified as issues related to human and organizational/technological systems.  Many organizations identify root causes and causal factors as associated with people, processes and procedures (3P’s).  At ThinkGRC, we break them down into three parts, Problems Classification, Programs and Management Systems as related to human and organizational/technological systems, but we will talk more about that in the future.  For now, make sure that you are looking into your organization, operations, and people when conducting a Root Cause Analysis.
  5. RCA should be performed systematically and (usually) as part of an investigation.  RCA can be part “Art” and part “Science”.  It is recommended to have experienced personnel performing the RCA and encourage team member participation.  Team members should consist of personnel that have a detailed knowledge of the scope and operation being investigated.
  6. The RCA should be documented.  The results should be easily translatable for the common individual and actionable, containing Corrective and Preventive Actions (CAPA’s) and lessons learned.
  7. Corrective and Preventative Actions (CAPA) are an important part of RCA.  For every incident, there should be one or more Corrective and Preventative Actions identified.   As a good practice, for every Corrective Action identified, there should be a corresponding Preventative Action identified.  As an example, let’s say the Corrective Action in an incident is to reduce the heat of an oven at a certain time in the drying process which is automatically done by the automated manufacturing system.  The Preventative Action would be something like adding an item to a Standard Operating Procedure (SOP)/Checklist for an individual to manually validate the temperature change at a predetermined frequency or adding automated monitoring/alarming to validate the temperature change.
  8. An event can have more than one root cause.  For every root cause, there will be one or more causal factors.  We provided a brief explanation & justification for multiple root causes, but in reality the basic reasoning for it is that many processes are very large & complex and the problem statement dictates the RCA.  Depending on whether the problem statement encompasses the entire event, and based on the size and complexity of the event, multiple failures can occur each stemming from its own root cause.  Here is an example, we have a data center outage due the failure of a firewall and when it occurred, we lost access to some high availability systems.  In this scenario, we should not have lost access to these high availability systems, these systems should have failed-over/the load balancer should have identified the issue and automatically redirected to the secondary data center in a seamless transition (i.e. high availability).  The root cause of the firewall issue was identified as an issue with a firewall configuration upgrade that was conducted a couple hours earlier, but the reason why our high availability systems did not fail-over/redirect properly was not due to the FW upgrade issue, it was due to an improper configuration with the HA/load balancer that caused the fail-over/redirect not to occur and therefore this improper configuration is the root cause for the high availability operational failure.
  9. As stated, Root Cause Analysis is conducted (in general) for negatively impacting events.  The scale and level of detail conducted within the RCA is dependent on the size and scope of the event.  It is up to the organization/individual(s) to define how the RCA will be conducted.  A common industry term used to establish the focus, size, and scope of an RCA is called a “Problem Statement”.  The Problem Statement will frame the question to be asked and solved.  In practice, the Problem Statement will determine how we view the incident.  At times, we might view and conduct the RCA based on the full scope of the incident culminating in one large RCA or the Problem Statement might break the incident down into smaller functional operations for analysis.  To learn more about Problem Statements check the Wiki.
  10. The standard approach to Root Cause Analysis is hierarchical.  The hierarchical approach is viewed as a top down analysis starting with the Problem Statement.  In general, the objective it is identify a top level issue/problem, and then drill-down into the incident to identify the underlying causes.  The drill-down process is an exercise in identifying the “why” and “how” different variables contributed to the incident and the relationships among those variables.  There are a variety of tools available for RCA.  Many of these tools contain methodologies for identification, charting, and relationship mapping.
  11. When conducting a Root Cause Analysis, a good practice is to view a negatively impacting event as a sequence of events and/or along a timeline.  The sequence of events and/or timeline is used to identify factors contributing to the event and the relationships among them.
  12. Root Cause Analysis should be viewed as a process improvement tool and is usually coupled with other functions such as investigation or problem management or as part of compliance driven operations.  As with all tools used to identify issue, gaps, or non-conformance, RCA should be promoted as a tool for positive change.

Previous | Next

Print Friendly, PDF & Email