Applying Chaos to Production – Chaos Engineering 101

Jack Johnson
9 min readMar 12, 2021

“What happens when it all fails?”, said our CTO when we spoke about implementing a new cloud based infrastructure. The idea that at any point a key component of your infrastructure can fail is among the top concerns of most managers and engineers when it comes to looking after their systems. An easy answer is “let’s make it resilient”. This approach has been around a long time and as the possibilities of computing ever expand, has become the focal point of any cloud architects initial scribblings.

With the widespread adoption of cloud computing, surfaced the idea that servers and services should be able to heal themselves. Now, with the offering of services such as AWS AutoScaling, AWS Aurora and AWS Lambda, it’s easier than ever to make your solutions scale on demand and idle where demand is low.

Testing this scaling, and proving that your solutions can perform beyond the threshold of expected traffic is essential, as it ensures that you’re enabling business growth and reacting well to over-subscription.

But now we must ask ourselves, “how do we know this will continue to work during a disaster?”…

How it used to be done

An engineers best guess at evaluating a working scalable solution was to turn off a service during testing and check results against the time point of the server shutdown. Not only is this time consuming for the engineer but, becomes inaccurate in reporting, causing you to invest precious time and money in a resilience test that’s only output is an ‘it seems to work’ attitude.

Enter Netflix…

Netflix’s approach

It’s 2010, 4 years into providing streaming services and Netflix have 15 million subscribers. Their growth is unprecedented at this point and little did they know that only 2 years later, that number would double. Partnering with Amazon, Netflix migrate their on premise infrastructure to the cloud and soon found that the arsenal of opportunity they now held, allowed them to experiment in automation and innovation.

The cloud is all about redundancy and fault-tolerance. Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system.

- John Ciancutti (Vice President of Personalization Technology — Netflix)

The Simian Army

Chaos Monkey was the catalyst that began the production of a suite of tools that enabled Netflix to randomly disable their production services to make sure they can survive common failures without customer impact. Not only did this provide them with the means to ensure their environments were self healing but it aided in efforts to secure them, teasing weakness and vulnerability where an attack may occur during an outage.

An army was born…

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10–18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

At present (2021), Netflix don’t have a streamlined process for deploying Chaos Monkey, only a manual installation utilising Spinnaker. So we build our own.

Approaches

Some of the Chaos engines, such as Doctor Monkey, have been ingested into cloud provider standard practice, allowing autoscaling groups to detect unhealthy instances and replacing them with new instances from the template image; other engines, such as Security Monkey and Janitor Monkey are quite intrusive and if left unchecked, could destroy your environments.

The best advice I can give to engineers who wish to go down the route of implementing chaos engineering practices is… Design is key.

To make the most out of these practices, your solutions need to be air tight. Architecting solutions around these use cases, utilising services such as AWS AutoScaling and segregated Subnet groups are key in allowing solutions to self-heal.

Chaos Monkey in practice…

Example Solution Architecture for a LAMP stack in AWS : Jack Johnson

In this example, the solution is built as a LAMP stack, hosted across multiple availability zones, segregated by technology type and set up in a autoscaling groups.

Additional Security considerations in this example have been included (AWS WAF, AWS Shield, etc.).

To test this solution against the expected demand of the client, we utilised Apache JMeter to mimic traffic flow, pushing 100,000 ‘users’ through the application over a recurring period of 10 minutes; logging in, connecting to multiple endpoints and submitting data within the PHP application.

The run was looped over a window of 2 hours, allowing the infrastructure to adapt and scale to a stable state. During this window, we started running our custom Chaos Monkey implementation to randomly turn off EC2 instances in the given autoscaling groups.

The results were promising during the test and on completion showed that with 32 EC2s terminated (over the period), only a 0.0006% fail rate occurred on the 1.2 million ‘users’; meaning that our environment could reliably scale and recover themselves to 400% of the client’s requirement.

All we had to do next was handle the 72 failures with clever error handling and retry attempts, then roll Chaos into production (which I’ll cover in another article).

Building our own Chaos Engine

Services such as Gremlin exist to provide chaos engineering out of the box, but being able to fully control the chaos was at the top of our requirements list when planning to build this, along with being able to vary the randomness and fairness of the terminations.

So where did we start?

Our main concerns was the compute layer, the element that was the most ‘customised’ in regards to our Apache rulesets and PHP code .

We knew that where we were using an AWS Network Load Balancer (NLB) and utilising AWS’s Aurora service, that the resiliency of the Load-balancing an Database services were fairly solid.

We decided that building chaos that affected our compute layer would be the best first step to adapting our systems to chaos.

The code

Below is the the code for building a Chaos Monkey engine, that I’m coining ‘Chaos Chimp’ (for purposes of not being sued by Netflix).

If there is interest in other engines of the Simian Army, I’ll publish some more code examples in future articles.

Prerequisites

  • Basic experience with AWS (EC2, Lambda, IAM).
  • A simple EC2 autoscaling group.
  • A Node.js 14.x lambda function created with ‘Author from scratch’.
  • Modifications made to the role of the previously created lambda function (adding AmazonEC2FullAccess and AutoScalingFullAccess).
  • A CloudWatch Event Rule that runs the lambda every minute.
AWS CloudWatch Event Rule that runs a Lambda function every minute — Jack Johnson

Defining variables

We need to define the region and the autoscaling group we want to terminate instances from, it’s best to assign these as variables and change them when needed. Also included is the declaration of AWS’s SDK package:

//Defines run variables
var autoScalingGroupName = "TestAutoScalingGroup";
var awsRegion = "eu-west-1";
// Include packages and set versioning
const AWS = require('aws-sdk');
AWS.config.update({region:awsRegion});
AWS.config.apiVersions = {
autoscaling: '2011-01-01',
ec2: '2016-11-15'
};
const autoscaling = new AWS.AutoScaling();
const ec2 = new AWS.EC2();

Building logic

We are going to run this function every minute, but don’t want an instance to terminate every time, to handle this, we need to build some logic into the function to (theoretically) randomise the termination:

//Begins the Lambda run handler
exports.handler = (event, context, callback) => {
// Setting parameters for randomisation
var min = 1;
var max = 6;
// Generating random number
var randomNumber = Math.floor(Math.random() * (max - min + 1)) + min;
// Sets the function to run only if the 'dice lands on 5'
if (randomNumber === 5) {
>> run code... } else { >> end run...

}
};

Gathering resources

Once intiated and upon the ‘roll of a 5’, we want to collect information on the autoscaling group and pull together an array of instance candidates for termination:

//Begins the Lambda run handler
exports.handler = (event, context, callback) => {
// Setting parameters for randomisation
var min = 1;
var max = 6;
// Generating random number
var randomNumber = Math.floor(Math.random() * (max - min + 1)) + min;
// Sets the function to run only if the 'dice lands on 5'
if (randomNumber === 5) {
var message = "(" + randomNumber + ") You've hit on a quincunx, choosing an instance to terminate.";
console.log(message);
// Set autoscaling group name from variables
var autoscalingParams = {
AutoScalingGroupNames: [
autoScalingGroupName
]
};

// Get autoscaling details
autoscaling.describeAutoScalingGroups(autoscalingParams, function(err, data) {
if (err) {
callback(err, null);
} else {
var instances = [];
var chosenInstance = '';
for(var i = 0; i < data.AutoScalingGroups[0].Instances.length; i++) {
var obj = data.AutoScalingGroups[0].Instances[i];
instances.push(obj.InstanceId);
}
});
} else {
var message = "(" + randomNumber + ") You've been lucky, no instance will be terminated.";
console.log(message);
callback(null, message);
}
};

Terminating the instance

Once we have the instances of the autoscaling group together in array instances, we now have to randomly choose an instance and run termination, this should cause the autoscaling group to self heal and restore service.

//Begins the Lambda run handler
exports.handler = (event, context, callback) => {
// Setting parameters for randomisation
var min = 1;
var max = 6;
// Generating random number
var randomNumber = Math.floor(Math.random() * (max - min + 1)) + min;
// Sets the function to run only if the 'dice lands on 5'
if (randomNumber === 5) {
var message = "(" + randomNumber + ") You've hit on a quincunx, choosing an instance to terminate.";
console.log(message);
// Set autoscaling group name from variables
var autoscalingParams = {
AutoScalingGroupNames: [
autoScalingGroupName
]
};

// Get autoscaling details
autoscaling.describeAutoScalingGroups(autoscalingParams, function(err, data) {
if (err) {
callback(err, null);
} else {
var instances = [];
var chosenInstance = '';
for(var i = 0; i < data.AutoScalingGroups[0].Instances.length; i++) {
var obj = data.AutoScalingGroups[0].Instances[i];
instances.push(obj.InstanceId);
}
// Choose random instance from array
var randomInstance = instances[Math.floor(Math.random()*instances.length)];
var message = "Instance " + randomInstance + ", you've been chosen!";
console.log(message);
chosenInstance = randomInstance;

const terminateParams = {
InstanceIds: [
chosenInstance
]
};
// Terminate instance chose randomly
ec2.terminateInstances(terminateParams, function(err, data) {
if (err) {
callback(err, null);
} else {
var message = "The following instance is been termiated : " + randomInstance;
console.log(message);
callback(null, message);
}
});
}
});
} else {
var message = "(" + randomNumber + ") You've been lucky, no instance will be terminated.";
console.log(message);
callback(null, message);
}
};

Full code set available of GitHub: https://github.com/jackisaacjohnson/chaos-chimp

Note: This is an updated version of an article that I originally wrote in 2017.

--

--

Jack Johnson

Learn more about the public cloud, from basics to building complex solutions, from the perspective of a Cloud Engineer, come Technology Leader.