Use Chaos Monkey to push engineers to build resilient cloud services

Netflix’s engineering team is good at sharing the tools they create, and keeping them updated to serve different needs. Chaos Monkey is the latest offering that received a considerable overhaul.

Chaos Monkey

The tool simulates the unexpected disappearance of random servers that run inside a production environment by simply switching them off.

As Lorin Hochstein and Casey Rosenthal from the Chaos Engineering Team at Netflix noted, they are “taking the pain of disappearing servers and bringing that pain forward” to push engineers to think about redundancy and automation and build it into their code.

“We value Chaos Monkey as a highly effective tool for improving the quality of our service,” they added.

The team has released version 2.0 of Chaos Monkey, and it comes with new features, including more choices when scheduling terminations, grouping targets and specifying exceptions. Also, they’ve added cross-account terminations, automatic disabling during an outage, a method to track terminations, and more.

Finally, unlike previous versions, this one can only turn off instances. Users who still want to keep the range of termination choices provided by earlier versions (e.g. burning up CPU or taking disks offline) should consider whether they actually want to upgrade and loose those options.

Chaos Monkey is part of Netflix’s Simian Army, a suite of tools for making sure an organizations’ cloud is safe and highly available.

This version of Chaos Monkey is available as a standalone service and according to the creators, it should work with any backend that Spinnaker (Netflix’s continuous delivery platform) supports: AWS, GCP, Azure, Kubernetes, and Cloud Foundry.

Don't miss