Achieving high-availability cloud architecture requires more than one cloud. From an architecture perspective, there are only three options for mission-critical systems: multi-cloud, hybrid cloud, or hybrid multi-cloud. Using a single cloud is really a single point of failure. To achieve high availability, redundancy is required everywhere to do away with the single points of failure.
While there are many things that result in losing access to the cloud, let me give you three major examples.
First, the cloud can go down. News from the past 12 months is full of headlines on major cloud outages. In many cases, these outages affect more than one availability zone or region. Network architects design a network across two providers, never one. It is what we have done for decades and there is no reason to change it now.
A second reason clouds fail is related to a component called the control plane that is in the orchestration part of the cloud. The control plane makes everything work. If a misconfiguration, bug, or hacking event takes the control plane down, the cloud goes down, which means all of its customers go down.
Hacking is the third major factor that can lead to cloud failure. A massive distributed denial of service (DDoS) attack has the potential to take down the entire cloud. TechRepublic reports that DDoS attacks were at an all-time high in the first three months of 2022, nearly five times higher than they were for the same period in 2021. When an attack is successful, we will suffer if we are not using a multi-cloud or hybrid-cloud strategy.
For a cloud to be considered reliable, it must provide “five nines” availability. This means the cloud will be available 99.999 percent of the time or provide less than 5 minutes and 15 seconds of downtime per year, but things happen which can make this impossible. For example, in 2021 AWS, Google, and Azure all had major outages and did not provide 99.999 percent availability. When these clouds went down, so did the cloud providers' customers.
There is a simple fix for this problem: don't put all your eggs in one basket. Use multiple clouds so that when one cloud fails, the customer is still up and running. My favorite architecture for organizations with mission-critical needs involves at least one cloud provider and a data center running an OpenStack or Nutanix private cloud. For even greater availability needs, the solution is a hybrid cloud connected to at least two public clouds. This establishes a hybrid-cloud and a multi-cloud.
To ensure high availability, redundancy must also be involved in connecting the organization to the cloud. High availability at the customer site requires that there be two routers connecting to each cloud provider, with each router being a high-availability router with multiple control modules and power supplies. The power supplies should be plugged into different circuits with different UPS systems and different generators for backup power. This provides redundancy in the brains of the routers and in the power supply.
Extremely robust cloud security architecture is not for everyone, primarily because it can be expensive and complex. But it is essential for those who have absolute mission-critical security requirements that need absolute high availability.
From a strategy perspective, we want to make sure that we are using common security services across both clouds. This allows for a single security policy that we can deploy on everything. When we know that all of our security devices are the same brand from the same manufacturer, we know that we won’t have a problem with vendor interoperability. In addition, if we have bugs, we will have an easier time addressing them.
We are also going to use industry standards for our security rather than relying on the security that comes from the cloud provider. While cloud providers have good security products, security is not their specialty. Cisco, Palo Alto Networks, Fortinet, and CheckPoint have been making security appliances for decades. For many of these organizations, security is all they do. When we go with an organization that only does security, we will get more robust features than what the cloud provider offers.
When designing the security architecture, we acknowledge that a lot of what we are dealing with are web apps, which need to be protected from DDoS. The best thing that we can do to really make our web application scale and give us some protection is to use a content delivery network (CDN). For the most part, CDNs only forward legitimate requests to the servers, so they're going to block all the illegitimate requests. We can use a CDN and their DDoS protection, whether it is Cloudflare’s DDoS protection or Shield on AWS, or Azure’s DDoS protection.
Next, we need firewalls. We want to use enterprise-grade, next-generation firewalls like a Cisco or Palo Alto firewall so we can have the same feature sets and functionality across the clouds and data center. However, doing that poses a problem. You can't knock on the door at AWS, go to a rack, and screw in your high availability firewalls. So, achieving enterprise-grade, robust security appliances, no matter what cloud we are on, requires having high-performance firewalls and security appliances residing on a virtual machine.
This leads to another problem: virtual machines are not high-availability devices. This means we will need to load balance between them. We will set up two enterprise firewalls on two virtual machines and use a load balancer between them to increase performance and availability.
Finally, we will need something to stop the bad actors who get through the firewall. Typically this will be an intrusion detection and intrusion prevention system. Because we are using a next-generation firewall, intrusion detection and prevention are built in. If someone breaches our firewall, hopefully, our next-generation firewall creates a new rule, resets a TCP connection, and can stop the attack.
At this point, we’ve established a good foundation for a high-security cloud, but we need to think of security like it’s an onion that always has more layers to peel back. There are a host of other components we can add to move from moderate to extreme security.
For example, we can add access control lists to protect our subnets. Whether it is an AWS network access control list or an access control list on a router, we can use that to boost security. With most cloud providers, we also have the ability to protect our virtual machine with a security group, which is basically another firewall.
We can also take security on the servers to the next level. We definitely need to put anti-malware protection on our servers and patch them against known vulnerabilities, but that's not enough. We also need to disable the unnecessary services. We want to harden our systems and lock them down, and we want to provide encryption for data at rest. AES 256-bit encryption is the best thing for this. With AWS, this requires that we enable the key management system. On Azure, it's automatic.
Identity access management determines and keeps track of who a user is and what they are allowed to do. However, the creation of users and management of users, especially at scale, can be complicated. To address this efficiently, most organizations are getting something like Microsoft Active Directory, AWS AD Connector, or Azure AD.
A strong identity and access management protocol will keep bad actors out while limiting and keeping track of what authorized users can do. This can keep those with system access from deploying things that do not meet security standards.
Unauthorized access can also be achieved when an employee falls victim to a social engineering scheme or a phishing email. For this reason, we really need to train the employees of the organization. Any effective cloud multi-cloud security architecture strategy will involve training employees on security awareness.
Attacks happen, and the better your ability to monitor them and respond to them, the faster you can mitigate and remediate the attack. I believe the best approach to this is event-driven security, in which a particular type of attack triggers a particular type of response.
For example, when someone makes an object public that is not supposed to be public, we can have a log notification that results in some type of code function that remediates it for us. With AWS, when Cloud Monitoring sees something that we don't like, it kicks off a Lambda function that remediates it.
When we add monitoring and auto-remediation to our robust security architecture, we enter into the next generation of security in which the damage caused by breaches is self-healing.
While the scenarios that can lead to a cloud failure are many and varied, they can all be anticipated and addressed by implementing a dependable multi-cloud architecture with robust security. Remember that a single cloud is a single point of failure, and a single point of failure does not provide the high availability that modern business requires.