Mar 3, 2023
Cloud architects must come to grips with the fact that, despite their best-laid plans, outages occur. In fact, one day we may encounter a more pervasive outage than ever before. It’s best to plan accordingly.
Cloud architects must come to grips with the fact that, despite their best-laid plans, outages occur. Sure, we do everything we can to prevent them and drill on recovery capabilities during blue-sky days, but the unforeseen is always around the corner. Or, as Amazon CTO Werner Vogels more memorably put it, "everything fails all the time."
In fact, one day we may encounter a more pervasive outage than ever before. It’s best to plan accordingly.
In coining the term “Black Swan event,” the mathematician and essayist Nassim Nicholas Taleb defined it as an event that is;
- improbable;
- has an extreme impact, and;
- is only conceivable after the fact.
By definition, we can’t imagine the next Black Swan event threatening to disrupt cloud environments. But, given enough time and opportunity, one is almost inevitable.
It's not only The Big One cloud teams must prepare for. They must also account for outages localized to a particular region, service provider, or service offering and have a playbook for less severe disruptions. So, how do organizations capitalize on the benefits of cloud computing – scalability, self-service, speed to market, etc. – while coping with the inevitability of disruption?
By focusing on resilience against the most severe consequences of cloud failure.
Preparation is the most important step in the incident response lifecycle. That includes architecting cloud environments for optimal capacity, availability, performance, and security.
Before explaining how that can be done, let’s briefly examine different types of service disruptions.
Catastrophic failures
Catastrophic failures are protracted and widespread outages affecting a cloud segment. It may result from a Black Swan event (such as an act of war), or a more mundane but equally disastrous event like an introduced software bug. In fact, as is the human tendency, we often over-index rare but dramatic risks like a plane crash compared to more likely ones such as car accidents. Similarly, in the cloud, we tend to plan for catastrophe more often than (the more likely) performance degradation.
Blackouts
Blackouts are complete disruptions in cloud services that may result from a power outage, technical failure, DDoS attack, or human error. In 2017, a regional outage affecting a large cloud services provider (CSP) was caused by a single misentered command. Today, major providers integrate redundancies into their architecture to ensure no single point of failure can cause a widespread outage, but they still occur. A London-based data center, for instance, suffered a blackout after multiple power sources failed and an on-site generator failed to kick on. This is why it’s important to integrate capabilities like automatic failover to route traffic around a severely impacted data center in the event of an outage.
Brownouts
Brownouts are service degradations resulting from unexpected traffic spikes or sudden, unforeseen connectivity issues. An example of a brownout would be the severing of undersea cables off the coast of Marseille last fall. Typically, these instances are handled by rerouting traffic around the service degradation. Brownouts can be complex to diagnose and manage since they affect customers differently – not all experience the same degree of service impairment. But SD-WAN dynamic path selection allows traffic to be routed around troubled spots when networking teams detect above-average latency.
Ensuring continuity through disruption and disaster
Distributed architecture plays a major role in advancing organizational resilience. When hosting data and applications in the public cloud, it’s important to hedge against risk by relying on multiple regions and availability zones (AZs). For instance, hosting too many applications in the same AZ raises the chances of a localized outage being excessively costly for an organization. Architects must understand where their data resides and leverage failover capabilities in order to protect against regional incidents.
Nevertheless, CSP-side disruptions remain a looming possibility. In this case, CXOs must ensure they have a range of solutions at their disposal for supporting business continuity. The goal is continuous availability for mission-critical applications and resilience – including graceful service degradation when an application fails.
At the least severe end of the spectrum, users may experience brownouts as more instability than outages. This can make the root cause of performance degradations difficult to diagnose. But solutions that proactively query SaaS applications and report on performance feedback can be instrumental in deciding whether poor performance merits addressing.
Resilient cloud architecture should allow for high customer autonomy when addressing brownouts. Dynamic, performance-based solutions can proactively alert network teams when the digital experience score for a SaaS application drops. Based on the results of continuous probing of core business applications – OneDrive, Gmail, Box, etc. – for HTTP latency, advanced performance monitoring can autonomously establish new connectivity tunnels that provide an optimal traffic path.
Connectivity issues rising to blackout designation should be addressed by even greater traffic routing flexibility. Customer-controlled exclusion capabilities allow IT teams to route traffic around data centers experiencing connectivity issues. These can be customized to automatically restore optimal traffic flows once the outage is resolved. This mix of manual networking flexibility and automated, snap-back functionality advances performance without overburdening staff.
In truly disastrous scenarios, where an organization may have trouble accessing a CSP like Zscaler, there are still ways of bypassing that service while maintaining an acceptable security posture. And here, we come back to planning as the critical element in mitigating the worst effects of a catastrophic outage.
To take Zscaler as an example, disaster recovery allows traffic to bypass the Zscaler cloud and connect to a public service edge in the customer’s local data center or in a public cloud where the most updated security policies are still applied without disrupting the business. This allows our customers to continue connecting users to applications previously defined as trusted and allowed regardless of the extent of the outage.
No substitute for preparation
Just because unplanned outages are a near certainty over the long term, doesn’t mean they have to massively disrupt business operations. Prioritizing ample data center capacity, smart failover capabilities, flexible traffic routing, and planning for worst-case scenarios can help ensure major and minor outages don’t become extinction-level events.
Netflix became the model of this practice with its commitment to chaos engineering. By intentionally deleting instances of cloud servers, the company was simulating a common problem that could lead to service disruptions for customers. As a result, the streaming giant’s engineering teams could be confident in the availability and performance of its infrastructure, even in the face of cloud-based hiccups.
Practices like these just go to show, optimal resilience requires configuration and practice – before you’re forced to implement your organization’s disaster recovery plan.
What to read next
Recommended