Understanding Failure to Ensure Proactive Maintenance

July 30, 2020 11:37 am || || Categorized in:


In today’s highly competitive environment, eliminating unplanned downtime is more important than ever, especially when that downtime occurs during a batch that places the product at risk. Did you know “best-in-class” maintenance organizations spend 90% of their total maintenance time on proactive maintenance (planned, kitted and scheduled corrective, preventive, and predictive tasks) and the remaining 10% on reactive (unplanned) maintenance work? There are many activities these “best-in-class” organizations do exceptionally well to achieve this high level of proactive maintenance performance. This article will look at how a greater understanding of failure will help your organization develop the right mindset to achieve “best-in-class” proactive maintenance performance levels.

Before jumping into the details, there are some key concepts to review:

  1. Failure is a process and not an event. If it is a process, we can manage it.
  2. Failures don’t just happen; they are caused. Because most failures are caused and do not just happen, they are preventable.
  3. Studies indicate that human errors cause over 70% of asset functional failures.
  4. Detecting and minimizing asset failures, including minimizing the impact of failures, is the responsibility of everyone in the organization and not just the maintenance department.

People expect systems to work, but they do fail. The Design for Reliability (DfR) process attempts to identify and remove failure modes during the design phase of projects. Still, most organizations do not typically consider the reliability of their assets and/or systems until they are already installed, commissioned, qualified, and operating.

Most maintenance programs are not based on mitigation strategies developed from the proactive analysis of failure modes. Maintenance strategies are often developed after an asset or system has failed. The answer is typically to develop and implement a time-based preventive maintenance (PM) task that is not linked to the failure mode and will not eliminate the failure from reoccurring. A better understanding of failure will enable an organization to develop risk-based, failure mitigation strategies that link the maintenance strategy to the specific failure mode.


Failure is either an incident or condition that causes an asset to degrade or become unable to perform its intended function safely, reliably, and cost-effectively.


Assets typically fail due to poor design, such as:

  • Improper materials of construction
  • Improper sizing
  • Not designed for continuous operation
  • Not designed for the operating environment or conditions
  • Poorly designed components
  • Lack of redundancy for critical assets to allow for proper maintenance, which creates single points of failure

And human errors, such as:

  • Overloading
  • Operational errors
  • Ignoring failure symptoms
  • Failure to perform, or improper performance of, preventive (PM) or predictive (PdM) maintenance
  • Not repairing an asset when there is evidence of a pending failure
  • Operator/Maintenance Technician skill level

An asset can also fail due to normal wear and tear or an act of nature (i.e., a 100-year flood).


An evident failure will, on its own, sooner or later, become evident to an operator under normal circumstances.

A hidden failure will not become evident to an operator under normal circumstances if it occurs on its own. Hidden failures do not have a direct impact, but they expose an organization to multiple failures with potentially severe or catastrophic consequences. Hidden failures are typically associated with protective devices which are not fail-safe.

Examples of hidden failures:

  • Temperature Switch – designed to shut down a process when the temperature rises above or drops below, a set limit. This only matters if the process rises above, or drops below, the set limit (a second failure).
  • Emergency Stop (E-Stop) – designed to shut down a process in the event an emergency condition occurs. This only matters if an emergency condition occurs (a second failure).


A complete failure is a total loss of an asset’s ability to perform its intended function.

A partial failure is a partial loss of an asset’s ability to perform its intended function.

Complete or partial failures can further be divided into Age-Related or Random failures. Age-related and random failures can also be further broken down into Sudden or Gradual failures.


Six general failure curves represent patterns for the probability of failure to occur.

The X-axis represents time, and the Y-axis represents the probability of failure.

RSmith Blog Charts-01

In studies tested over the decades, somewhere between 75-89% of failures are Random, and the remaining 11-25% of failures are Age-Related. Typically, 80-90% of an organization’s proactive maintenance program is built around time-based PM tasks, which is not designed to address the 75-89% of random failures effectively. Preventive (repetitive) maintenance is designed for the 11-25% of age-related failures (normal wear and tear), but if other factors are at play, these will be mostly ineffective. If a PM is performed too often, the PM can be a factor in inducing failures by over “PMing” the equipment. Since random failures account for 75-89% of all failures, what are the primary sources for these failures? In short, people, and more specifically, when people interact with your assets.


Age-Related and Random Failures can be either Sudden or Gradual failures. These four failure classifications will help determine the maintenance strategy.

By understandin

RSmith Blog Charts-02

g failure, your organization will be able to conduct Risk Assessments and Failure Modes and Effects Analysis (FMEA) to identify asset:

  1. Functions – primary and secondary
  2. Functional Failures
  3. Failure Modes
  4. Failure Causes
  5. Failure Effects
  6. Identify and prioritize risks based on severity, occurrence, and detection and assign a Risk Priority Number (RPN)
  7. Recommend actions to mitigate the risks identified
  8. Verify the results

We recommend starting these activities on the top 5-10% of your critical assets.

An effective maintenance program is a collection of mitigation methods that can only be developed after there is a thorough understanding of failure. Asset and system proactive maintenance strategies must be risk-based and provide a clear line of sight from the failure mode to the maintenance activity. Implementing proactive maintenance strategies will minimize the potential for supply chain disruptions.


Click the button below to talk to our experts about your proactive maintenance strategies!




About the Author

Robert Smith is a senior Asset Management & Reliability (AM&R) professional with 40+ years of engineering, maintenance, and project management experience including developing and managing asset management and M&R programs utilizing industry and global best practices and methodologies.