What is Chaos Testing?

Chaos testing has two unusual connections to the movie industry.

First, the practice of chaos testing is the brainchild of none other than the Netflix engineering team. This white-knuckle approach to resilience testing helped them deliver their massive data streaming infrastructure.

Second, there's Jurassic Park.

Early in Spielberg's CGI epic, two great minds argue about the correct approach to systems design. John Hammond, the park owner, proudly claims that he anticipated every possible problem and installed safeguards to protect visitors. Dr. Ian Malcolm, an expert in chaos theory, argues that you can't predict every eventuality. No matter how organized you are, no matter how developed your plans, "life finds a way" of causing havoc.

When the antagonist Nedry shuts down the security system, it causes a cascading system failure that leads to two hours of dinosaur-related mayhem, proving Dr. Malcolm right - you can't stop chaos.

And that's the principle of chaos testing. Instead of waiting for the inevitable catastrophe to happen, you create one in a controlled environment, measure the outcomes, and fix them before they become a problem.

What is Chaos Testing?

Chaos testing is a type of resilience testing designed for the cloud computing era. Today's networks are widely distributed and need a high level of fault tolerance. To assess this, you need a new approach to testing.

In the early part of the last decade, Netflix still used traditional development models, including resilience testing. These tests involved working with a finished product in a test environment, manipulating some of the environment settings, and seeing how the product coped under pressure.

But this model didn't address some of the problems that emerged when working with the new AWS infrastructure. Public cloud meant that services would move between nodes – and that some nodes may drop out unexpectedly. The result was a hit to customer experience, leading to slow streams and dropped connections.

Netflix decided to challenge the existing software development model. Instead of seeing failure as an occasional exception, they would assume failure as a rule. Chaos is inevitable, especially in a massive public cloud infrastructure. So, how do you plan around it?

The Netflix engineering team developed Chaos Monkey, one of the first chaos testing tools. Chaos Monkey creates faults by disabling nodes in the production network – that is, the live network that serves movies and TV to Netflix users.

In a white paper, Netflix described how their chaos testing process works:

Define the steady-state: Identify the key variables that indicate when the network is functioning normally. This includes environmental variables (such as network performance) and customer metrics (such as site availability or streaming speed).
Create a control group and test group: Make two comparable test groups. This will allow the chaos testers to account for any external factors, such as AWS issues.
Crash the test group: Chaos Monkey switches off nodes within the production network, thereby limiting effects to the test group rather than the entire userbase. However, this test group does contain live users who are streaming content.
Compare the test group and control group: If the system is resilient, then the test group and control group should both remain in the steady state. If there is any variation in key variables, it indicates there is an underlying resilience issue.

The chaos testing model drives Netflix's engineering team to create a resilience-first model. Like Dr. Malcom, they assume that chaos will eventually emerge in any system. Instead of avoiding it, they build systems that can respond and adapt to failure.

Chaos Testing Principles

Netflix's white paper outlines five key principles of chaos testing:

1) Build a Hypothesis around Steady-State Behavior

With any test, it's essential to start by defining the metrics. Chaos testing is ideal for measuring system outcomes. For example, Netflix focuses on customer-facing metrics like latency and dropped connections.

To identify the most relevant metrics in your chaos tests, start by asking: who feels the impact of a major systems failure? For example, if your data pipeline goes down, it might hinder your analytics and BI tools. This, in turn, might impact the decision-makers within your business.

2) Vary Real-World Events

Chaos is, well, chaotic. The first iteration of the Chaos Monkey tool simulated a specific failure: one node in the network becoming unavailable. Over the years, Netflix has developed the Simian Army, a suite of chaos testing tools that replicate a range of different failures, including a complete regional failure of AWS.

In any chaos test, it's important to think about all the different things that can go wrong, including the most catastrophic system failures. A natural disaster could take out on-premise systems, while cloud services might go offline if there's a large-scale DNS attack. How quickly could you recover from events like these?

3) Run Experiments in Production

Testing your software in a dev environment is like testing your dinosaur park without any dinosaurs. It will give you some useful data, but you won't see how your infrastructure performs in a real-world scenario.

If Netflix can run tests in production, so can you. However, it's important that you segment your experiments so that you have a control group. A control group can help to isolate any noise in the test data, such as an issue with your cloud host or data warehouse.

4) Automate Experiments to Run Continuously

Netflix recommends a DevOps-style approach to chaos engineering, as manual testing is time-consuming and unsustainable. Many of the Simian Army tools can run automatically on a schedule and issue reports if they detect any issues.

This approach does require you to have some DevOps practices in place. You'll need a team who can work on resilience reports immediately. They'll need the resources to build, test, and deploy fixes as quickly as possible.

5) Minimize Blast Radius

You get a lot of great data when you discover a resilience issue in your production environment. Unfortunately, it means that you've also probably directly affected some of your users. For instance, if you are watching Netflix when they run an unsuccessful chaos test, your movie might stop streaming.

You can avoid this problem by doing two things:

Perform tests in a controlled fashion so that you can easily roll back any changes. It's often better to use a test platform like Simian Army than to switch off servers manually.
Keep a close eye on key metrics during the testing. If any of the customer-facing metrics start to drop, you'll need to roll back any changes immediately. Alternatively, your test tools can return everything to the previous state.

Brief, controlled chaos testing should yield sufficient data without impacting the customer experience.

Getting Started with Chaos Testing

Chaos testing is relatively easy to perform if you're using cloud-based systems. To get started right now, follow these steps:

1. Speak to all stakeholders: Because you're working with production data, it's essential to talk to anyone who may be impacted by a service loss. This can include internal users, such as analytics experts reliant on fresh data, or customer relations experts who would have to deal with any service outage.

2. Set up chaos testing tools: The Simian Army suite is available for use under Apache 2.0 license, or you can develop an in-house chaos testing tool.

3. Choose a chaos level: You can use testing tools to create different levels of chaos. These are generally defined as:

Low chaos: Easily recoverable failures
Medium chaos: Easily recoverable failures that may cause some availability issues
High chaos: Crisis-level failures that cause substantial service unavailability
Extreme chaos: Catastrophic failures that may result in data loss

Related Reading: What is Chaos Engineering?

4. Respond to test reports: When you have a failure report, you'll need to design an appropriate solution. This might be a small fix, like creating a redundancy somewhere in the network. Alternatively, you may need to consider a substantial change to your architecture.

5. Deploy and retest: If you're running an automated test schedule, you should ideally have your fix in place before the next test cycle.

Remember – an error in testing is an error that may arise for customers and service users. Chaos testing is simulating real events that happen all the time.

How Xplenty Can Help

Jurassic Park really is the story of a chaos test. The pivotal moment of the story is when one of the engineers, for nefarious reasons, takes a crucial system offline. The result: an unpredictable cascading systems failure.

When you're working with data, a system failure probably won't lead to a T-Rex breaking loose. But system failures can cascade in unpredictable and catastrophic ways, leading to service unavailability or loss of data.

Xplenty creates a neat, manageable data pipeline between your production databases and your data warehouse. It's secure and reliable, with robust security. A unified approach to data aggregation helps to reduce the potential chaos in your infrastructure.

If you want to run chaos tests on your data infrastructure, Xplenty is the ideal platform. You have full visibility of data moving through your ETL process so that you can track against steady-state performance with ease. If you'd like to see how Xplenty can help you keep order, book a consultation and schedule a demo today.

Big Data

Chaos Testing:
Resilience Testing Gone Wild

Table of Contents:

What is Chaos Testing?

Chaos Testing Principles