What is a single point of failure (SPOF)?

A single point of failure (SPOF) describes a system vulnerability in the form of a single component. If the component fails, the entire system fails. What are the different types of SPOF and how can you minimise the risk of SPOFs happening?

What is a single point of failure?

A single point of failure (SPOF) describes a type of vulnerability within a system. A SPOF exists when the malfunction of a single component causes the failure of the entire system. Several 'failure modes' exist. These can be broadly distinguished into three categories:

  1. Achilles’ heel or 'weakest link in the chain': The loss of one component leads to a sudden loss of function of the entire system.
  2. Chain reaction or 'domino effect': The failure of one component causes the successive failure of other components leading to the failure of the entire system.
  3. Bottleneck: A component acts as a limiting factor of the overall system. If the limiting component is impaired, the performance of the system is reduced to the capacity of the component.
Note

A single point of failure doesn’t necessarily describe a technical component. One of the most frequent cases is human error.

Where do single points of failure occur most often?

SPOFs are common in complex systems with interconnected components in multiple layers. Depending on the structure of the system, the failure of one critical component causes the failure of the whole system. The single point of failure takes the form of a critical component.

The complexity of a multi-layered system can make it difficult to detect SPOFs. There’s no easy way to identify the interactions of individual components. Faults or issues are hard to spot. Principally, every non-redundant component critical for operation should be treated as a single point of failure.

Take the human body, for example. We’ve only got one heart or brain – the critical organs are not designed redundantly. If one of these organs fails, the entire body fails. Heart and brain are SPOFs. By contrast, eyes, ears, lungs, and kidneys are duplicated. If necessary, the body compensates for the failure of one and continues operating at reduced efficiency.

In a data centre, all components critical to operation are potential SPOFs. Therefore, servers are usually equipped with redundant connections to the power grid and network. Mass storage is provided redundantly via RAIDs or similar technologies. The aim is to ensure the system continues to operate should a single, critical component fail.

Tip

Not sure what a server is? Check out our article that explains what a server is.

What are some classic SPOF examples?

There are many different types of single points of failures (SPOFs). After all, SPOFs don’t just affect information systems. Let’s take a look at some examples.

Death Star destroyed by single point of failure

In the popular 'Star Wars' movies, a single point of failure leads to the destruction of the dreaded 'Death Star'. A single proton torpedo fired by the protagonist hits a critical spot on the reactor. The explosion causes a catastrophic chain reaction that destroys the entire Death Star.

Suez Canal paralysed by single point of failure

In 2021, container ship 'Ever Given' got stuck in the Suez Canal. The ship ran aground at a critical section of the canal acting as a single waterway. The blockage paralysed shipping traffic along the entire canal. The single point of failure was the non-redundant waterway.

Boeing 737 MAX crashed by SPOF

In 2018 and 2019 there were two crashes of the 'Boeing 737 MAX' aircraft causing the loss of hundreds of lives. The cause of the crashes was a single sensor feeding erroneous data. Based on the sensor data, the automatic flight control system didn’t perform correctly and ultimately brought down the planes. Several errors came together, but the single point of failure was the sensor.

High-availability systems taken offline by SPOF

Even systems designed for high availability aren’t fully protected from SPOFs. In recent years, major cloud services have repeatedly experienced serious failures. In most cases, the single point of failure was human. The wrong configuration data can quickly paralyse an entire production system, even if its components are designed redundantly.

DNS as single point of failure in computing systems

Your device is connected to Wi-Fi, but the web browser isn’t working. Then the clock starts automatically adjusting the time. Sound familiar? It’s enough to make you tear your hair out, but the answer is simple:

Quote

'It’s always DNS.' – Source: https://talesofatech.com/2017/03/rule-1-its-always-dns/

The catchphrase 'It’s always DNS' sounds fun but is a serious description of the error potential of Domain Name Systems (DNS). After all, when DNS servers don’t answer, websites and services can fail in a variety of ways. The effect is similar to having your connection to the Internet cut. However, packet traffic between IP addresses still works in this case.

DNS errors usually occur on the user side if the DNS servers stored in the system are not accessible. It’s therefore best practice to store two name server addresses. If the first server is unavailable, the second is used. This creates redundancy and resolves the single point of failure.

Often, both DNS servers belong to the same organisation. If one of them fails, there’s a high probability that the other is also affected. To be safe you can store the addresses of two nameservers from different organizations. A popular combination is 1.1.1.1 and 9.9.9.9 from Cloudflare and Quad9 as primary and secondary DNS servers.

Java logging library as single point of failure

By the end of 2021, a large number of Java-based web services were affected by the Log4Shell security gap. The single point of failure was a Java logging library called Log4J. In the worst case, a system attack led to the takeover of an entire vulnerable system.

How to avoid SPOFs?

Generally, prevention is the best strategy to avoid SPOFs. By definition, a single point of failure leads to the loss of function of the entire system. Once that happens, it’s often too late. Limiting the damage may be your only option now.

That’s why preventive measures and planning for emergencies are a better strategy. You can act out credible failure scenarios and analyse risks and possible protective measures. Different types of single points of failure can be prevented by certain features in a system:

System feature Protects against Description Example
Redundancy Achilles’ heel, bottleneck System can continue to operate without performance degradation in the event of failure Multiple DNS servers stored in network device
Diversity Chain reaction Lowers risk of redundant components being affected by failure Linux computers not vulnerable to Windows Trojans
Distribution Chain reaction, Achilles’ heel, bottleneck Lowers risk of redundant components being affected by failure Head of state doesn’t travel on the same plane as his vice
Isolation Chain reaction Disrupts domino effect Fuse protects power grid from overload
Puffer Bottleneck Absorbs load peaks occurring before bottlenecks Queue in front of check-in counter at airport
Graceful Degradation Achilles’ heel, chain reaction Allows for continued operation of the system without catastrophic result in case individual components fail When losing one eye, vision is not entirely lost but depth perception is disrupted

Well-prepared, critical systems are subjected to continuous monitoring to detect errors as early as possible and correct them if necessary.

Minimise single points of failure through redundancy

One recommendation to counteract SPOFs is to build redundancies. Several instances of a critical component (e.g., power supply, network connection, DNS server) are operated in parallel. If one fails, the system continues to operate without loss of performance.

Redundancy also prevents many SPOFs on the software-side. One example is the popular microservice compared to the software monolith. A system of microservices is decoupled and less complex, making it more robust against SPOFs. Since microservices are launched as containers making it easier to build redundancies.

But how exactly does redundancy protect a system? Let’s use the estimation of reliability of a system known as 'Lusser’s law' to illustrate. Here’s a thought example:

Assume a system has two independent, parallel connections to a power supply. Let us further assume that the probability of the connection failing within a given period is 1 percent. Then the probability of complete failure of the power link can be calculated as the product of the probabilities:

  1. Probability of failure of an instance:

1% = 1 / 100 = 1 / 10 ^ 2 = 0.01

  1. Probability of two instances failing in succession:

1% * 1% = (1 / 10 ^ 2) ^ 2 = 1 / 10 ^ 4 = 0.0001

As you can see, the probability of a SPOF isn’t halved when running two instances but reduced by two orders of magnitude. That’s a considerable improvement. With three instances running in parallel, a failure of the entire system should be almost impossible.

Unfortunately, redundancy is no panacea. Rather, redundancy protects a system from SPOFs within certain assumptions. First, the probability of failure of an instance must be independent of the probability of failure of the redundant instance(s). That’s not the case where a failure is caused by an external event. If a data centre is on fire, redundant components fail together.

In addition to redundancy of deployed components, distribution of certain components is critical to mitigate SPOFs. Geographic distribution of data storage and computing infrastructure protects from environmental disasters. Further, it pays to strive for some heterogeneity or diversity of critical system components. Diversity reduces the probability of redundant instances failing.

Let’s illustrate the advantage of diversity using the example of cybersecurity. Imagine a data centre with redundant load balancers of the exact same design. A security vulnerability in one of the load balancers also presents in the redundant instances. In the worst case, an attack will paralyse all instances. By using different models, the overall system stands a better chance of continuing to operate at reduced performance.

Other strategies to minimise SPOF

Redundancy isn’t always sufficient to prevent SPOFs. And some components cannot be designed redundantly. When creating redundancy isn’t an option, other strategies come into play.

The 'defence in depth' approach is well-known from cyber security. This involves drawing multiple, independent rings of protection around a system. These must be breached one after another to bring about system failure. The likelihood of the entire system failing because of a single component is lower.

With respect to digital systems, special programming languages with a built-in fault tolerance exist. The best-known example is the 'Erlang' language developed at the end of the 1980s. Together with the associated runtime environment, the language is suitable for creating highly available, fault-tolerant systems.

The global triumph of the World Wide Web and the spread of web development presented a new challenge. Programmers were forced to develop websites that would work on a variety of devices. The basic approach used in this process is known as 'graceful degradation'. If a browser or device doesn’t support a particular technology on a page, it’s rendered as good as possible. This is a 'fail-soft' approach:

System status Description
go System operates safely and correctly
fail-operational System operates fail-tolerant without performance degradation
fail-soft System operation ensured, but performance reduced
fail-safe No operation possible, system security still guaranteed
fail-unsafe Unpredictable system behaviour
fail-badly Predictably poor to catastrophic system behaviour

How to find a SPOF in your IT?

Don’t wait until the system fails to identify single points of failure in your system. You’ll want to proceed proactively as part of a Risk Management Strategy. Analyses from reliability engineering such as fault tree or event tree analysis are used. Failure Mode and Effects Analysis (FMEA) are used to identify the most critical sources of failure.

The general approach to identifying single points of failure in enterprise IT is to perform a risk assessment of the three main dimensions:

  • Hardware
  • Software/services/provider
  • Personal

First, create a SPOF analysis checklist to show the general areas for further analysis. Then, a detailed SPOF analysis of the individual areas is performed:

  • Unmonitored devices in the network
  • Non-redundant software or hardware systems
  • Staff and providers who cannot be replaced in an emergency
  • Any data not included in backups

For each system component, the negative effect of failure is analysed. Furthermore, the approximate probability of a catastrophic failure is estimated. The results are incorporated into an overarching 'disaster recovery' plan to ensure data centre security.

As a basic measure to avoid SPOFs, redundancy of data storage and computing power should be ensured at three levels:

  • At the server level through redundant hardware components
  • At the system level through clustering, i.e. the use of multiple servers
  • At data centre level by using geographically distributed operating sites.
Was this article helpful?
Page top