As the web has grown increasingly complex alongside technologies like cloud computation, distributed systems, and microservices, system failures are harder to predict. To prevent outages, companies large and small have turned to chaos engineering as a solution.
Chaos engineering lets you predict and identify potential failures by breaking things on purpose. This way, you can find and fix failures before they become outages. Chaos engineering is a growing trend for DevOps and IT teams. Even companies like Netflix and Amazon use these principles in product development.
If you are new to chaos engineering, you’re in the right place. Today, we will introduce its principles in depth and show you how to get started with Kubernetes.
We will learn:
Learn the principles of chaos engineering with Kubernetes with this deep dive into chaos experiments, such as destroying a network, draining nodes, testing availability, and more.
The DevOps Toolkit: Kubernetes Chaos Engineering
Chaos engineering is a discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. With chaos engineering, we intentionally try to break our system under certain stresses to determine potential outages, locate weakness, and improve resiliency.
Chaos engineering is different from software testing or fault injection. Chaos engineering is used for all sorts of requirements and unpredictable situations, including traffic spikes, race conditions, and more.
With chaos engineering, we are trying to learn how an entire system reacts when an individual component is failing.
For example, chaos engineering can help answer functionality questions like these:
History: Chaos Engineering was first developed at Netflix in 2008 when their subscription streaming service was transitioned to the public cloud. Netflix’s engineers noted that they needed new ways of testing this system for resiliency.
Chaos Monkey was created in 2010 for that purpose. Since then, chaos engineering has grown, and companies like Google, Facebook, Amazon, and Microsoft have implemented similar testing models.
Chaos engineering offers many benefits that other forms of software testing or failure testing cannot. Failure tests can only examine a single condition in a binary breakdown. This doesn’t allow us to test a system under unprecedented or unexpected stresses.
Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos engineering, we can fix issues and gain new insights about an application for future improvements.
Chaos experiments help to reduce failures and outages while improving our understanding of our system design. Chaos engineering improves a service’s availability and durability, so customers are less disrupted by outages. Chaos engineering can also help prevent revenue losses and lower maintenance costs at the business level.
Before we start defining and running chaos experiments, we need to pick a tool. Chaos engineering is not yet a segment of the market that is well established and developed. Nevertheless, there are several tools we can pick from.
One of the most notable tools for chaos engineering is Simian Army, developed by Netflix. Simian Army is best for services in the cloud and AWS. It can generate failures and detect abnormalities. Chaos Monkey from Netflix is a resiliency tool for instances of random failures.
PowerfulSeal is a powerful tool for testing Kubernetes clusters, and Litmus can be used for stateful workloads on Kubernetes. Pumba is used with Docker for chaos testing and network emulation. Gremlin offers a Chaos Engineering platform that now supports testing on Kubernetes clusters.
Chaos Dingo is commonly used for Microsoft Azure, and Chaos HTTP Proxy can be used to introduce failures into HTTP requests.
As more teams have conducted experiments over the years, they’ve learned how to most effectively apply chaos engineering approaches to their systems. These best practices have become the core principles of chaos engineering. Let’s discuss the core principles of chaos engineering that every team should implement in their experiments.
You want to build a hypothesis around a steady-state behavior. Then, you want to perform potentially damaging actions on the network latency, applications, nodes, or any other component of the system.
You want to create violent situations to confirm that our steady-state hypothesis holds. you aim to validate that when our system is in a specific state, it performs certain actions, and finishes with the same validation to confirm that the state did not change.
You want to do chaos engineering based on real-world events. In other words, only replicate events that are likely to happen in our system. This includes an application crash, network disruption will go down, or node failure.
You want to run chaos experiments in production. you want to experiment in production since that is the “real” system. If you perform chaos experiments only during staging or integration, you cannot get a real picture of how the system in production behaves.
You want to automate our experiments to run continuously or be executed as part of continuous delivery pipelines. This could mean every hour, every few hours, every day, every week, or every time some event is happening in our system. You also want to run experiments every time you are deploying a new release.
You should reduce the blast radius of our experiments. When you start with chaos experiments, you want to start small and build up as you gain confidence in a system. Eventually, you should do experiments across the whole system.
Summary of Principles
- Build a hypothesis around a steady-state
- Simulate real-world events
- Run experiments in production
- Automate experiments and run them continuously
- Minimize blast radius
Now let’s apply all that theory to a simply real-world example to better understand chaos engineering. We will be using Kubernetes. To begin, we create a Kubernetes cluster. Then, we will deploy our simple application and destroy it. Then, we will show you how to define steady-states, which is crucial for chaos engineering.
Note: If you are new to Kubernetes, we recommend the course A Practical Guide to Kubernetes before continuing with chaos engineering. Or, you can follow along just to get an idea of how basic chaos engineering looks.
First, we need a Kubernetes cluster to destroy. You can choose Minikube, Docker Desktop, AKS, EKS, and GKE. Below, we use Docker Desktop to create a cluster. If you would like to learn how to create a cluster using the other tools, please refer to the course The DevOps Toolkit: Kubernetes Chaos Engineering.
# Source: https://gist.github.com/f753c0093a0893a1459da663949df618##################### Create A Cluster ###################### Open Docker Preferences, select the Kubernetes tab, and select the "Enable Kubernetes" checkbox# Open Docker Preferences, select the Resources > Advanced tab, set CPUs to 4, and Memory to 6.0 GiB, and press the "Apply & Restart" button######################## Destroy the cluster ######################### Open Docker Troubleshoot, and select the "Reset Kubernetes cluster" button# Select *Quit Docker Desktop*
We need to deploy a demo application, which we’ve prepared below. We’re going to clone the repository vfarcic/go-demo-8
created by Viktor Farcic.
git clone https://github.com/vfarcic/go-demo-8.git
Next, we enter into the directory where we cloned the repository.
cd go-demo-8
git pull
Now, create a namespace called go-demo-8
.
kubectl create namespace go-demo-8
Now, let’s take a quick look at the application we’re going to deploy, located in the terminate-pods
directory, in a file called pod.yaml
.
---
apiVersion: v1
kind: Pod
metadata:
name: go-demo-8
labels:
app: go-demo-8
spec:
containers:
- name: go-demo-8
image: vfarcic/go-demo-8:0.0.1
env:
- name: DB
value: go-demo-8-db
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /
port: 8080
readinessProbe:
httpGet:
path: /
port: 8080
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 50m
memory: 20Mi
This app is defined as a single Pod with one container called go-demo-8
. It includes other resources like livenessProbe
and readinessProbe
.
Learn chaos engineering for Kubernetes without scrubbing through videos or documentation. Educative’s text-based courses are easy to skim and feature live coding environments, making learning quick and efficient.
Now, we apply that definition to our cluster inside the go-demo-8
Namespace. This will get our application up and running as a Pod.
kubectl --namespace go-demo-8 apply --filename k8s/terminate-pods/pod.yaml
Now it’s time to apply some damage and destroy our application!
To perform chaos experiments to our application, we can use the Chaos Toolkit plugin for Kubernetes. This toolkit does not support Kubernetes out-of-the-box. We need a plugin for features beyond basic out-of-the-box features. Let’s install a Kubernetes plugin using pip
.
pip install -U chaostoolkit-kubernetes
Note: Explore the Chaos Toolkit plugin using the
discover
command to see all its features, options, and arguments.
Let’s start destroying stuff. Look at the first definition that we will use, located in the chaos
directory, in the fileterminate-pod.yaml
.
cat chaos/terminate-pod.yaml
This gives us the following output:
version: 1.0.0
title: What happens if we terminate a Pod?
description: If a Pod is terminated, a new one should be created in its places.
tags:
- k8s
- pod
method:
- type: action
name: terminate-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=go-demo-8
rand: true
ns: go-demo-8
Now that we have seen the definition, let’s run terminate-pod.yaml
.
chaos run chaos/terminate-pod.yaml
The output is as follows:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] No steady state hypothesis defined. That's ok, just exploring.
[... INFO] Action: terminate-pod
[... INFO] No steady state hypothesis defined. That's ok, just exploring.
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed
After the initial validation, it ran the experiment called What happens if we terminate a Pod?
and found that there is no steady state hypothesis defined
. Judging by the output, there is one action terminate-pod
.
Next, it went back to the steady state hypothesis
and determined that there is none. Then, it tried rollback
, and it found out that it could not. All we have done so far is execute an action to terminate a Pod. We can see the result in the last line: experiment ended with status: complete
.
Now, let’s output the exit code of the previous command. If we get 0
, this means success in Linux. Those exit codes tell the system whether it’s a failure or a success!
Now, let’s take a look at the Pods in our Namespace.
kubectl --namespace go-demo-8 get pods
The output states that no resources
were found in go-demo-8 namespace
.
We deployed the single Pod and ran an experiment that destroyed it. We did not do any validations. We executed a single action to terminate a Pod, which was successful.
Above, all we did was destroy a Pod. The goal of chaos engineering, however, is to find weak points in our clusters. So, we normally start defining a steady-state that we test before and after an experiment.
If the state is the same before and after, we can conclude that our cluster is fault-tolerant for that case. In the case of Chaos Toolkit, we accomplish this by defining steady state hypothesis
.
We’re going to look at a definition that specifies the state that will be validated before and after an action.
cat chaos/terminate-pod-ssh.yaml
The output will give us:
> steady-state-hypothesis:
> title: Pod exists
> probes:
> - name: pod-exists
> type: probe
> tolerance: 1
> provider:
> type: python
> func: count_pods
> module: chaosk8s.pod.probes
> arguments:
> label_selector: app=go-demo-8
> ns: go-demo-8
The new section is steady-state-hypothesis
. Now we can run a proper chaos experiment to test our steady state.
Let’s run a chaos experiment to see a proper result.
chaos run chaos/terminate-pod-ssh.yaml
We get the following:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... CRITICAL] Steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: failed
There is a critical issue here: Steady state probe 'pod-exists' is not in the given tolerance
. The probe failed before we executed actions because we destroyed the Pod. So, our experiment failed and confirmed that the initial state doesn’t match what we want.
So, let’s apply the terminate-pods/pod.yaml
definition to recreate the Pod. Then, we can see what happens when we re-run the experiment with the steady-state-hypothesis
.
kubectl --namespace go-demo-8 apply --filename k8s/terminate-pods/pod.yaml
With our pod back, and can re-run the experiment.
chaos run chaos/terminate-pod-ssh.yaml
The output is as follows:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: completed
Nowe, we see that the probe pod-exists
confirmed a correct state and the action terminate-pod
was executed. We can also see that the steady-state was re-evaluated. The Pod existed before the action, and Pod existed after the action. But, wow can the Pod exist if we destroyed it?
The experiment didn’t fail because our probes and actions were executed immediately after one another. Kubernetes did not have enough time to remove the pod entirely. So, we need to add a pause to make the experiment more useful. Let’s look at a YAML.
cat chaos/terminate-pod-pause.yaml
It gives us the following output:
> pauses:
> after: 10
We see here that we added a pauses
section after the action
that terminates the Pod. Now, when we execute the action to terminate the Pod, the system will wait 10 seconds before validating our state.
Let’s see what we get if we execute this experiment with our pause.
chaos run chaos/terminate-pod-pause.yaml
It gives us the following output:
[... INFO] Validating the experiment's syntax
[... INFO] Experiment looks valid
[... INFO] Running experiment: What happens if we terminate a Pod?
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... INFO] Steady state hypothesis is met!
[... INFO] Action: terminate-pod
[... INFO] Pausing after activity for 10s...
[... INFO] Steady state hypothesis: Pod exists
[... INFO] Probe: pod-exists
[... CRITICAL] Steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
[... INFO] Let's rollback...
[... INFO] No declared rollbacks, let's move on.
[... INFO] Experiment ended with status: deviated
[... INFO] The steady-state has deviated, a weakness may have been discovered
This time, the probe failed and said that steady state probe 'pod-exists' is not in the given tolerance so failing this experiment
. Now, we gave Kubernetes enough time to remove the Pod, and then we validated if the Pod is still there.
The system came back to us saying that the Pod is not present. We can output the exit code of the last command to see that it did indeed fail.
Awesome! We’ve effectively destroyed our application using a steady-state and learned the basics of chaos engineering. Next, we would want to fix the errors that we created to make it fault-tolerant.
From there, we can do all kinds of more destruction and testing to our application such as:
To learn how to implement more chaos experiments, Educative’s course The DevOps Toolkit: Kubernetes Chaos Engineering is the best next step. You’ll be introduced to the different types of experiments you can run in chaos engineering. Towards the end of the course, you will learn how to run experiments in a Kubernetes cluster. By the end, you’ll be a confident chaos engineer.
Happy learning!
Free Resources