Question 1

What is high availability?

Accepted Answer

High availability (HA) is a system design principle that ensures a system can withstand a certain amount of failure and still maintain acceptable levels of performance. In other words, HA systems are designed to keep running even when parts of them fail. Using redundancy and replication, highly available systems ensure that the failure or disruption in a single component does not impact the availability of the entire service.

Question 2

High availability vs. fault tolerance

Accepted Answer

High availability and fault tolerance are two terms that refer to designing systems with the possibility of failures in mind and being able to tolerate such failures and continue operating. Although those two principles are often used interchangeably, there is a nuanced difference between the two. Fault tolerance, in fact, expands on high availability and offers a greater level of protection in case infrastructure components fail. Fault tolerance is essentially a more stringent version of HA, where even less (or no) downtime is expected, however, this usually results in an additional cost due to the increased resilience.

Question 3

High availability vs. disaster recovery

Accepted Answer

High availability is a principle that deals with failures at the component level, so that if there's a failure at the component level, it will continue to run and not impact the wider system. Disaster recovery, on the other hand, deals with wider system failures. For example, adopting disaster recovery principles would deal with the scenario of a local data center that disappeared in a natural disaster and you need to figure out how to route your applications around that. In a nutshell, high availability makes sure that there are processes and systems in place to deal with failures in individual components of the system, while disaster recovery makes sure such processes are in place to deal with a failure that would impact a larger portion of the entire system.

Question 4

What are high availability clusters

Accepted Answer

High availability clusters, also referred to as failover clusters, are groups of physical machines (hosts) that work together to keep applications running with minimal downtime. By having multiple redundant computers, the clusters are capable of providing uninterrupted service even when individual components or hardware fails. In an HA cluster, all hosts have access to the same shared storage, allowing the virtual machines (VMs) of one host to take over when the VMs on another host fail, this way ensuring continuous uptime.

Question 5

How does high availability work?

Accepted Answer

High availability can be implemented in various ways, such as clustering, load balancing, and failover. Clustering involves grouping multiple servers together to provide a single point of access and allows for a single system to take over in the event of a failure. Load balancing involves distributing the load across multiple systems to ensure that no one system is overloaded, while failover is a process that allows a system to switch to a backup system if the primary system fails. HA systems are designed based on three key principles: eliminating single points of failure, reliable crossover, and detection of failures.

Question 6

How to achieve high availability

Accepted Answer

To achieve high availability for your systems, there are a few steps to keep in mind, including: designing systems with HA in mind, defining success parameters, continuous testing of the failover systems, continuous monitoring, analyzing and evaluating data.

Question 7

Why is high availability important?

Accepted Answer

High availability overall reduces risks in your system. By eliminating single points of failure, and implementing reliable crossover and mechanisms for timely detecting failures, you are drastically reducing the risk of downtime for your application, you are making sure your customer-facing services are reliable and have a high level of uptime. High availability, for example, is crucial for any e-commerce company that wants to make sure their store is always up and running since any downtime has a direct impact on revenue and customer satisfaction.

Question 8

How to measure high availability

Accepted Answer

Availability is defined as the percentage of the time a system is available to the user and it is calculated using a basic formula that is the ratio of the time the system is expected to be available (uptime) to the total amount of time we are measuring (expected uptime + expected downtime). There are three key metrics that are commonly used to measure availability effectively: Mean Time Between Failures (MTBF), Mean Time To Repair/Recovery (MTTR), and Mean Time to Detection (MTTD).

Understanding High Availability:
How Does It Work and Why Is It Important?

What is high availability?

High availability clusters

How does high availability work?

High availability vs. fault tolerance

High availability vs. disaster recovery

How can you achieve high availability?

Why is high availability important?

How can you measure high availability?

Achieve high availability with Traefik Enterprise

References and further reading