Zscaler Blog
Get the latest Zscaler blog updates in your inbox
SubscribeZscaler Business Continuity in the Wake of Major IT and Security Outages
July 2024 was a busier few weeks than usual for IT and security professionals, executives, and boards. On July 18, a Microsoft Azure Central US outage impacted dozens of Azure services. Then, the very next day on July 19, a CrowdStrike outage disrupted Windows machines (clients and servers) globally with the notorious blue screen of death (BSOD).
The Zscaler Zero Trust Exchange is a mission critical service, so we received a flurry of calls following these incidents from concerned customers who wanted to understand whether Zscaler was exposed to a risk similar to CrowdStrike’s. We advised our customers and prospects at length in an oversubscribed series of CXO briefings and a public Preventing Cloud Outages Webinar, and will address the key points and most common questions in this post:
1. “Can this happen to Zscaler?”
2. “Does Zscaler Client Connector run in kernel space?”
3. “How does Zscaler handle upgrades?”
4. “How does Zscaler approach cloud resilience?”
5. “What advice do you have for improving operational resilience?”
What happened in the CrowdStrike and Azure Outages?
In the cloud world, physical failures (which are often much easier to imagine) are less frequently responsible for outages when compared to code bugs and configuration changes. Indeed, both the July 18 and 19 outages were caused by changes pushed by the providers. Microsoft pushed an incomplete allow list change that blocked off critical access to US Central - Azure Storage by trusted VM hosts. In the case of CrowdStrike, it was a faulty rapid response content update (behavioral signature) that was pushed all at once to 8.5 million endpoints, triggering a memory exception that caused a BSOD.
Can this happen to Zscaler?
While no cloud vendor is immune to issues, Zscaler has multiple safeguards in place that dramatically lower the risk. While we have our own endpoint agent, its function and the types of updates we push to it are very different from endpoint security vendors:
- Zscaler customers initiate their own upgrades. It’s been our philosophy from day one that Zscaler should not force client upgrades or auto-push sensitive updates to the endpoints. The customer is in full control. To test and validate updates in your environment, you can customize your Client Connector App Store update settings to apply updates only to specific user groups and automatically slow roll them over 7 days.
- Most updates are done in the cloud—not on the endpoint agent. Client Connector is a lightweight agent that is responsible for authentication and intercepting and forwarding traffic to the cloud, where policy is enforced. It neither runs complex security detections nor policy enforcements at the kernel space or user space.
- Zscaler tests updates in our own environment first. We drink our own champagne—Zscaler’s own IT is the first adopter for every new release candidate (even prior to GA).
Does Zscaler Client Connector run in kernel space?
Most of the Windows Client Connector processes run in user space, however, certain functionality such as traffic interception and anti-tampering (e.g. prevent malware from stopping Client Connector) can only operate in a kernel driver architecture.
One of Client Connector’s primary functions is to intercept outbound traffic generated by local applications and redirect the IP packets through a ZIA tunnel or ZPA tunnel to the Zscaler cloud. In ZIA specifically, the drivers for traffic interception in Tunnel mode have to run in the kernel space (other interception methods such as Tunnel with Local Proxy do not require kernel drivers). In ‘ZIA Tunnel Mode - Route Based’, Client Connector runs a Virtual Network Adapter driver and in ‘ZIA Tunnel Mode - Packet-Filter Based,’ it runs a Windows lightweight filter driver (LWF) that forwards packets to a dedicated Client Connector process based on customer-defined filters.
Zscaler Client Connector in Tunnel Mode with Windows Packet Filter
The combination of customer-controlled slow rollouts, the infrequent changes to the client software as compared with security content updates, the lightweight nature of the Client Connector and its traffic interception drivers (all policy is done in the cloud - client connector neither runs nor auto downloads security detections), and rigorous testing prevents costly endpoint outages.
How does Zscaler handle upgrades?
In addition to the client updates described above, Zscaler also performs two other types of updates — cloud security feed updates and cloud software upgrades. Balancing velocity with risk mitigation is our core design principle — for all types of changes.
To keep ahead of adversaries and improve detection efficacy, Zscaler continuously pushes dozens of security feeds, signatures, behavior detections, and ML model updates to protect our end users. Time is of the essence in these updates, but deploying safely is as crucial. We perform a combination of early soaking and canary builds.
On the cloud software side, we follow a similar ‘rings of protection’ model where updates are slowly rolled out in a canary fashion over the course of several weeks.
How does Zscaler approach cloud resilience?
Business and operational continuity is not just a kernel issue. Delivering a mission-critical cloud service demands a holistic approach:
- Running a trusted production cloud architected from the get-go for scale, performance, security and reliability.
- Running a resilient cloud management life cycle. In the famous words of Andy Jassy, “There is no compression algorithm for experience.”
- Providing tools to handle all failure scenarios—blackouts, brownouts, and catastrophic failures.
Learn more about our approach on our resilience webpage and in the preventing cloud outages webinar.
What advice do you have for improving operational resilience?
- For Zscaler deployments, implement our business continuity capabilities and best practices, including disaster recovery (DR) for ZIA and ZPA, and perform Zscaler resilience audits with your Technical Account Managers.
- Consolidate on best-of-breed and integrated leading platforms to maximize your availability of key services if one platform goes down. Specifically, we recommend not to put all your eggs in one basket when it comes to endpoint, identity, zero trust access and application providers (including hyperscalers).
- Stagger rollouts of software and security updates to mitigate risks. Soak test updates with a small group of users to ensure that everything works before rolling out to additional groups.
- Review your critical vendors’ architectural and operational resilience.
A note about Zscaler’s customer-controlled business continuity offering
Customers, especially from regulated industries, often ask us, ”We trust Zscaler and we trust your investments in building the most trusted security cloud, but how should we plan for a force majeure event, as unlikely as it may be?” To address this use case, 18 months ago, Zscaler launched an industry-first customer-controlled disaster recovery (DR) capabilities, which have been widely adopted by our customer base.
We are glad to announce an upcoming evolution to this offering: fully managed and customer self-hosted private business continuity clouds for ZIA and ZPA.
The customer hosted DR cloud will ensure consistent functionality even during business continuity and the fully managed DR solution will provide the consistent functionality in addition to taking away the deployment overhead.
A note about business continuity for endpoint outages
Naturally, this event has led organizations to evaluate ways to boost operational resilience for Windows endpoints in particular. Adopting Zscaler’s Cloud Browser Isolation (CBI) and PRA (Privileged Remote Access) for secure, agentless BYOD access is a great way to achieve that proactive resilience without compromising security.
In the event of an endpoint outage, users can switch over to their own personal (or alternative devices) and maintain access to the most critical SaaS, private web-based apps or RDP | SSH | VNC systems without having to compromise on security or data leakage.
Reach out to us to learn more about our new Business Continuity offerings and Cloud Browser Isolation for endpoint business continuity! Contact us here or reach out to your account team.
Was this post useful?
Get the latest Zscaler blog updates in your inbox
By submitting the form, you are agreeing to our privacy policy.