Embrace chaos to improve cloud infrastructure resilience

Netflix is a champion of using chaos engineering to improve the resilience of its cloud infrastructure. That’s how it ensures its customers don’t have their Stranger Things binge watching sessions interrupted. Netflix is one of a growing number of companies including Nike, Amazon and Microsoft that leverage chaos engineering as a means of stress testing their cloud infrastructures against a variety of unpredictable cloud events, such as a loss of cloud resources or entire regions.

This has enabled them to create highly resilient infrastructure environments and ensure reliable application delivery. In doing so, they have also created a model any organization can follow to improve the security of its cloud computing services and APIs.

The number one cause of cloud data breaches is infrastructure misconfiguration, whether due to human error, a lack of policy controls in CI/CD pipelines, or bad actors. Modern cloud threats use automation to find and exploit these misconfiguration vulnerabilities before traditional scan and alert tools can identify them. In order to become more proactive and prevent these threats from doing any damage, an organization needs to simulate real-world misconfigurations to identify security gaps before they are exploited.

If you’ve been running in the cloud at scale, you’re familiar with the challenge of trying to constantly monitor for the security risks created by resources without known owners, misconfigurations, and humans making errors like leaving too much access after a maintenance event.

In fact, even if your organization has not yet migrated the majority of its IT systems to the cloud, you’re likely still familiar with these risks because they regularly make news headlines every time data is leaked due to an S3 misconfiguration. The root cause is typically self-inflicted in the form of not getting all the details right on your cloud configuration.

Sounds chaotic, and it is. Yet, the remedy is to embrace the chaos.

You cannot eliminate chaos. It’s constant and resilience is the only defense. Truly resilient cloud systems recover, no matter the source or nature of any damage. Achieving resiliency requires establishing a known-good system to constantly monitor itself and automatically revert to its known-good state whenever it is damaged.

In the cloud shared responsibility model, the security team’s chief responsibility becomes protecting the service configuration layer. Cloud services talk to each other via APIs, and the newer ones use identity to configure access, as opposed to the older IP address space confirmation method. The network perimeter is defined via SDN and security group configurations. Unlike in the data center, configuration changes to your basic security posture are accessed via API and are subject to a lot of change for many reasons. IT’s goal is to establish a configuration of these services that is resilient in the face of unpredictable damage – chaos.

This requires a mechanism to revert damaging changes to your cloud configurations back to the healthy ones. There are several ways to accomplish this. One is to write specific remediation scripts, although that also requires you predict what will go wrong and when, which as I’ve established, is unrealistic.

The more effective option is to implement self-healing configuration, i.e., capture a known-good baseline and leverage an engine that knows how to revert all mutable changes. Automating the process relieves the security team of the burden of manually monitoring for and remedying any potentially damaging changes to the environment.

Regular testing to determine if those automations are working do not focus whether compute resources reappear on deletion, but rather examines what happens if an IAM policy or Security Group definition is changed. The list doesn’t stop there. Other things you should test are S3 bucket configurations, VPC/Network configuration, password policies, etc. Resilient security demands covering all vulnerabilities an attacker may try to exploit.

The first step to introducing resilience into your organization’s cloud security posture is to start small by focusing on a single resource type like S3 buckets or Security Groups, then move to adopt a more robust tool to cover a larger scope.

You’ll need a method for introducing the chaos. I’m not aware of any good tools that do this out of the box, but it’s easy to get started either manually at the console, or via scripts that perform mutations. Once you have those two components, point some chaos at your cloud dev or testing environments. Pay particular attention to the completeness of the resilience (i.e., did all changes get remediated?), and the mean time to remediation (MTTR) which should be measured in mere minutes, not hours or days.

Leveraging cloud security chaos engineering will help ensure your cloud security efforts cover all critical cloud resources, such as network configuration, security group rules, identity and access management, and resource access policies; and enable automatic recovery from all misconfiguration events. It will also enable to demonstrate full compliance with any relevant laws, industry regulations and policies such as GDPR, HIPAA, NIST 800-53, PCI, CIS Benchmark, as well as all internal enterprise security policies.

Chaos is a constant in our world, and resilience is the only defense. That includes the configuration and security of cloud computing services and APIs.