High Availability - Do I need that?
High availability is one of the key issues when it comes to hosting and supporting IT applications. The increasing shift of business-critical systems to the cloud in particular is making high availability an even more important requirement.
But what exactly does “High Availability” mean? So-called highly available systems will keep the system up and running even if critical components fail. They are particularly resilient, meaning they are able to manage outages without causing service interruptions or data loss. High availability refers to the ability to avoid unplanned outages by eliminating the Single Point of Failure (SPOF). A single point of failure is any component of the system which would cause the rest of the system to fail if that individual component failed.
High availability is usually measured as a percentage of uptime. The number of “nines” is usually used to indicate the degree of high availability. For example, four “nines” represent a system that is available 99.99% of the time, which in mathematical terms means that it has a maximum downtime of 52.6 minutes during an entire year.
If significant financial loss or other damage is to be expected in case of a system outage, the implementation of a highly available system is highly recommended.
Let’s do the math again:
|Availability||Possible Downtime / Year|
Especially when implementing new products, the choice of the software provider and application design in terms of high availability is the biggest challenge. But especially small companies do not attach great importance to such aspects as well as extensive 24/7 system support. This can ultimately lead to the system architecture not being compatible with the company’s high availability solutions, which again results in high costs.
We will show you how such a highly available cloud environment can look like and what you should consider before the implementation, based on the example of our partner AWS.
High Availability in the AWS Cloud
These elements help you implement highly available systems:
Redundancy — ensuring that critical system components have another identical component with the same data, that can take over in case of failure.
Monitoring — identifying problems in production systems that may disrupt or degrade service.
Failover — the ability to switch from an active system component to a redundant component in case of failure, imminent failure, degraded performance or functionality.
Failback — the ability to switch back from a redundant component to the primary active component, when it has recovered from failure.
To ensure the highest availability possible, cloud service providers usually operate a large number of data centers worldwide and assign them to different regions and availability zones (AZ). AWS customers currently have access to 55 Availability Zones in 18 geographic regions, with each region consisting of multiple Availability Zones. Those Availability Zones are defined as physically separate locations. These isolated Availability Zones include at least one data center with independent power supply, cooling and network connection. This structure enables cloud customers to control where data and applications are hosted. This means you can build the foundation for high-availability services based on redundant infrastructures.
Compute, Databases and Storage
AWS helps you chieve high availability for your cloud workloads in different dimensions, e.g.:
- Compute Cloud: Amazon EC2 (Elastic Compute Cloud) and other services that let you provision computing resources, provide high availability features such as load balancing, auto-scaling and provisioning across Amazon Availability Zones, representing isolated parts of an Amazon data center.
- SQL databases: Amazon RDS (Relational Database Service) and other managed SQL databases provide options for automatically deploying databases with a standby replica in a different AZ.
- Storage services: Amazon storage services, such as S3 (Simple Storage Service), EFS (Elastic File System) and EBS (Elastic Block Store), provide built-in high availability options. S3 and EFS automatically store data across different AZs, while EBS enables deployment of snapshots to different AZs.
However, all other aspects of your infrastructure can also be designed to be highly available according to your wishes and thus adapted to your requirements.
The required level of high availability is decisive for the steps you need to take along the entire lifecycle of an application to ensure the corresponding uptime. There are a few guidelines to follow:
- Designing the system in a way that there is no Single Point of Failure (SPoF).
- Use automatic monitoring, fault detection, and failover mechanisms for both stateless and stateful system components.
- Single Points of Failure are typically eliminated with an N+1 or 2N redundancy configuration, where N+1 is achieved by load balancing between active-active nodes and 2N is achieved by a pair of nodes in active-standby configuration.
- AWS provides several methods to achieve High Availability through both approaches, such as a scalable, load-balanced cluster or by deploying an active-standby pair.
- Evaluate and test system availability according to the highest standards.
- Prepare workflows for manual mechanisms to respond to, mitigate, and recover from the downtime.
If an incident should occur despite all these precautions, it’s advisable to work out a plan with steps for resuming operations in advance, a so-called “disaster recovery plan”. This plan should ensure that data or important IT services can be fully restored after a disruption. Disaster recovery can also be purchased as a service from your cloud provider in the form of Disaster Recovery-as-a-Service (DRaaS). AWS provides you with various disaster recovery strategies that can be broadly categorized into 4 categories, ranging from low-cost and low-complexity backup creation to more complex strategies with multiple active regions.