Short note on high availability and fault tolerance & scalability and fault tolerance.

- September 12, 2022

HIGH AVAILABILITY AND FAULT TOLERANCE

An effective IT infrastructure must function even in the event of a rare network loss, device failure, or power loss. When the system fails, one or more of the three major availability techniques will kick in: high availability, fault tolerance, and/or disaster recovery. While each of these infrastructure design solutions contributes to the availability of your key applications and data, they do not fulfill the same goal. Simply because you run a High Availability infrastructure does not mean you need not set up a disaster recovery site and doing so risks disaster.

HIGH AVAILABILITY

A High Availability system is meant to be up and running 99.99 percent of the time, or as close to it as feasible. Typically, this entails creating a failover system capable of handling the same workloads as the original system. HA works in a virtualized environment by generating a pool of virtual computers and related resources inside a cluster. When one of the hosts or virtual machines dies, it is resumed on another VM in the cluster. HA is done in physical infrastructure by designing the system with no single point of failure; in other words, redundant components are required for all key power, cooling, computing, network, and storage infrastructure.

Hosting two identical web servers with a load balancer distributing traffic between them and an extra load balancer on standby is one example of a basic HA approach. If one of the servers fails, the balancer may route traffic to the other.

Fault Tolerance

The ability of a system (computer, network, cloud cluster, etc.) to continue working without interruption when one or more of its components fail is referred to as fault tolerance. The goal of developing a fault-tolerant system is to reduce interruptions caused by a single point of failure, while also assuring the high availability and business continuity of mission-critical applications or systems.

Fault-tolerant systems employ backup components that automatically take the place of failing components, ensuring that there is no interruption in service. These are some examples:

Hardware systems that are supported by the same or analogous hardware systems. A server, for example, can be made fault-tolerant by operating an identical server in parallel, with all processes mirrored to the backup server.

Software systems that are backed up by other instances of software. A database containing customer information, for example, can be continually copied to another system. If the primary database fails, operations can be immediately diverted to the backup database.

Power sources that have been made fault-tolerant through the use of alternate sources. Many firms, for example, have backup generators that can take over if the main power supply fails.

Similarly, any system or component that has a single point of failure can be made fault-tolerant through the use of redundancy.

SCALABILITY AND FAULT TOLERANCE

Scalability refers to the capacity to scale facilities and services on-demand as needed by the user. Scaling is beyond bounds, which implies we have no idea what the limit will be. Cloud middleware is built with scalability in mind across several dimensions, such as performance, size, and load. Cloud middleware controls a large number of resources and users who rely on the cloud to acquire resources that they cannot access on-premises without incurring administrative and maintenance fees. These expenses are borne by whoever creates, operates, and maintains the cloud middleware and provides the service to clients.

So, in this general scenario, the capacity to endure failure is typical, but it becomes more essential than delivering an efficient and optimized system at times. According to the general conclusion, 'it is a difficult challenge for cloud providers to design such highly scalable and fault-tolerant systems that can get maintained while still providing competitive performance.'

VMware vSphere Fault Tolerance (FT) creates a live shadow version of a virtual machine that replicates the original virtual machine to offer continuous availability for applications (with up to four virtual CPUs). If a hardware failure occurs, vSphere FT initiates a failover to reduce downtime and data loss. Following failover, vSphere FT builds a new, secondary virtual machine to provide ongoing protection for the application.

VMware Fault Tolerance ensures continuous availability for virtual machines by building and maintaining a Secondary VM that is identical to the Primary VM and is constantly available to replace it in the case of a failover emergency. Most mission-critical virtual machines have Fault Tolerance enabled. The Secondary VM is a second virtual machine that operates in virtual lockstep with the Primary VM. VMware vLockstep records inputs and events on the Primary VM and forwards them to the Secondary VM, which is running on a different host. Using this information, the execution of the Secondary VM is similar to that of the Primary VM. Because the Secondary VM is in virtual lockstep with the Primary VM, it may take over execution at any time without interruption, providing fault tolerance.

Search This Blog

Notes for BSc CSIT