Single point of failure

Where to look and how to find them

Single point of failure
Dave Rebbitt

If you’ve ever investigated an incident, led a formal hazard assessment or risk review, you may have heard this term. It arises in conversations I have, usually related to incident investigations.

What is a single point of failure? That’s a point in the task where only one barrier or control remains to prevent an incident. Essentially, if that single thing fails, everything will go awry. At least, that’s my definition.

Making the workplace safe, the conversation almost always revolves around barriers to loss or similar terms. These hazard controls are in place to ensure a safe workplace. Now we are all familiar with the hierarchy of controls, and that’s a good place to start when we talk about a single point of failure.

Controls

Each type of control has its own set of problems. For example, an engineering control. This is known to work reliably. However, electrical circuit breakers can become “tired,” guardrails that are properly maintained can still fail, and even your ergonomic office chair can wear out. There is a heavy reliance on engineering controls in the workplace and in our daily lives. These things are designed into the world we live in and are usually in the background. Often, they evade regular checks to ensure they are functioning properly, and when they fail, it usually causes some problems, but not a catastrophic outcome. I say not usually because if something like a shear pin on a winch fails, or does its job and shears through, bad things are usually going to happen.

The key aspect of engineering controls is that there is usually more than one. Take your vehicle, it has all these things to keep you safe. Newer vehicles have lane-departure warning systems, automatic braking systems, antilock braking systems, safety glass, and crumple zones, all designed to protect occupants. In the workplace, you have power tools that are grounded and plugged into an electrical outlet connected to a circuit breaker that will trip in the event of an overcurrent condition. Your power tool is also likely to have a trigger that requires positive pressure to keep the tool running. So when it comes to engineering controls, you don’t usually see a single point of failure, a single point at which if there is a failure, there’s going to be a very negative outcome, because usually there are other engineering controls and, well, aren’t there other layers of controls?

Now I know I’m supposed to talk about administrative controls right here, but let’s skip to personal protective equipment for just a moment. Personal protective equipment is the last line of defense. It is the least effective of all the controls. Usually, the personal protective equipment is a single point of failure. You have safety glasses to protect your eyes, gloves to protect your hands and boots to protect your feet. The choices with personal protective equipment are almost binary. Wear it, or don’t wear it. Sure, you can wear it poorly and degrade its effectiveness, but, after all, it is the last line of defense. If your PPE gets tested, it’s a good bet some other things have gone seriously wrong.

Now, let’s discuss administrative controls. Paperwork, including procedures, forms, and checklists. These are meant to serve as a guide for those doing the work, ensuring they do so in the safest manner possible. By following a procedure, a worker limits their exposure to the hazards associated with the task. Or at least that’s the theory. Procedures tend to be filed away on SharePoint sites or in software applications and are rarely viewed. Even so, they serve as a helpful guide for those observing the work. If we generally agree that people follow procedures, then having a single point of failure may not be such a huge concern. Some other administrative controls include warning signs, labels, or flagging. Those work fairly well, but only when people notice them and decide to follow the warnings they provide, such as flagging around an open excavation with an open excavation sign, or when they can’t read the sign. Sure, you could have a pictogram on the sign, hopefully, so that someone will understand that the pictogram denotes danger and not cross the flagging. The other common administrative control is training. We provide people with all kinds of training. It could be safety-specific training, such as WHMIS 2015 or confined space entry training. It can also be training on equipment or trade training.

People

The key aspect of administrative controls is that they rely on human interaction to be effective. There’s a heavy reliance on people making appropriate decisions. So, while the impression may be that there are a multitude of administrative controls, coupled with engineering controls and even personal protective equipment, at the center of it all is a person. This is often a single person. A single point of failure.

A single person making a single decision or determination may not necessarily be hazardous. In some settings, work is checked, so a bad decision or error is unlikely to cause a serious issue. However, in dynamic workplaces where many aspects of the work or environment are changing, good decision-making is particularly important.

Although you have built all these barriers to loss, the fulcrum of those controls is typically a single person. That would normally be a worker or supervisor.

Many tasks carry an elevated risk. Working at heights, crane lifts, confined space entry, and lockout are good examples.

In these cases, the single point of failure is often the supervisor. Supervisors come in many varieties. They can be very experienced or newly promoted. They can be highly trained, experienced, or both. They can be dealing with a job that is on schedule or behind schedule.

In high-risk activities, some believe that the worker is the single point of failure. They look to the worker to alert them if there’s a problem. I don’t think that’s really right. The single point of failure is the supervisor. The supervisor is often the one who determines whether it is safe to conduct that work. The supervisor is the one who makes critical determinations about whether equipment is safe to use, whether the readings in a confined space have been taken properly, and whether the sling used for a lift is fit for service.

It isn’t always the supervisor. There can be a worker who makes these determinations. There are a bunch of reasons why people might get things wrong. People are fallible. They are certainly not as reliable as an engineering control, but they are more reliable than poorly worn personal protective equipment.

Are you even looking?

It’s certainly worth considering when reviewing your critical task list. A critical task is one where, if it’s not done correctly, the outcome could be serious. As in seriously bad. When something bad does happen, we conduct an incident investigation, and many people put together causal chains, examining the contributing/immediate causes, as well as the underlying/root causes. They can be helpful in identifying single points of failure.

When conducting a risk assessment, we try to identify controls, but we often don’t consider whether those controls work together at the same point in the same task or whether there is a single point of failure. And if that single point of failure is a human being, maybe it’s time to involve another human being. Two heads are obviously better than one. That’s why we do risk assessments as a group.

So, the next time you’re conducting an incident investigation or risk assessment, consider where the single point of failure was. Because that single point of failure may still exist, and it may not have been identified as a cause in your incident. Perhaps next time you’re reviewing those organizational or formal hazard assessments, you might consider identifying single points of failure where a single failure can lead to events that result in an incident.

Whenever a single person has to make a critical decision, they rely on their training, skill, and experience. Not all people and not all things are created equal. It is easy to blame people after an incident for failing to see or consider something. Wouldn’t it be better if we proactively identified the single points of failure and had two supervisors make a determination rather than one? It doesn’t happen very often unless you’re doing a fit for duty intervention. Why is that? Because in a fit for duty intervention, you need to be sure. So you involve more than one person because a single person could make a mistake. More than one person where the two are equal in terms of responsibility or authority. Groups tend to defer to the person in charge—so there may still be that single point of failure.

Those single points of failure are out there. They are built into your systems. Finding them can be very easy or very hard, but you can never find what you don’t look for.