Until Next Time
This lesson recaps all the things that we have covered in this course.
We'll cover the following
That’s it, the course is finished.
I might extend it over time. However, at this moment, if you went through all the exercises and you did the homework, you hopefully learned what chaos engineering is and gained some experience using it. You have seen some of the benefits, the upsides and the downsides, and the traps behind it. I hope that you found it useful.
Now, let’s go through our checklist one more time.
What we covered in this course
At the very beginning of the course, we defined a list of the things we’ll try to accomplish. Let’s see whether we fulfilled those. As a refresher, the list of the tasks we defined at the very beginning is as follows.
Terminate instance of an app
The first item was to terminate an instance of an app. We went beyond that. We were terminating not only instances of an application, but also of its dependencies. We even created experiments that terminate random instances of random apps.
Partially terminate network
The next item was to partially terminate a network. We used Istio VirtualServices to define network routes. It was the right choice since Istio provides the means to manipulate the behavior of networking that can be leveraged by chaos experiments. As a result, we run experiments that terminate some of the network requests. That gave us an insight into the problems networking might cause to our applications. We did not stop at discovering some of the network-related issues. We also fixed the few problems we uncovered.
Increase latency
We also decided that we’ll explore what happens when we increase latency. Since, by that time, we were already committed to Istio, we used it for the experiments related to latency as well. We saw that the demo application was not prepared to deal with delayed responses, so we modified the definition of a few resources. As a result, the demo application became tolerant to increased latency.
Simulate Denial of Service (DoS) attacks
We also simulated a Denial of Service (DoS) attacks. Just as with other network-related experiments, we used Istio for that as well.
Drain a node
Then we moved into nodes. We saw how we could drain them, and we observed the adverse effects that might cause.
Delete a node
As if draining wasn’t enough, we also deleted a node. That, like most other experiments, provided some lessons that helped us improve our cluster and the applications running in it.
Create reports and send notifications
Since destruction is not the goal in itself, we dived into the generation of reports and notifications. We used Slack, and you should be able to extend that knowledge to send notifications somewhere as well.
Run the experiments inside a Kubernetes cluster
We learned how to run the experiments inside a Kubernetes. We defined Jobs, for one-shot experiments, that could be hooked into our continuous delivery pipelines. We saw how to create CronJobs that can run our experiments periodically.
We tried quite a few other things that I probably forgot to mention.
Things we did not do
It’s just as important to mention things that we did not do. We didn’t go beyond Kubernetes. I tried to avoid that because there are too many permutations that would need to be taken into account. You might be running your cluster in AWS, GCP, Azure, VMware on-prem, or somewhere else. Each infrastructure vendor is different, and each could fit into a course of its own. So we stayed away from experiments for specific hosting and infrastructure vendors. On the bright side, you probably saw in Chaos Toolkit documentation that there are infrastructure-specific plugins that we can use. If that’s not enough, you can always go beyond what plugins do by running processes (commands, scripts). Even better, you can contribute to the project by extending existing plugins or creating new ones.
Final thoughts
What you should not do, after all this, is assume that the examples that we explored should be used as they are. They shouldn’t. The exercises were aimed at teaching you how to think and how to go beyond the obvious cases. Ultimately, you should define experiments in your own way, and adapt the lessons learned to your specific needs. Your system is not the same as mine or anyone else’s. While we all do some of the things the same, many are particular to an organization. So, please don’t run the experiments as they are. Tailor them to your own needs.
You should always start small. Don’t do everything you can think of right away. Start with basics, gain confidence in what you’re doing, and then increase the scope. Don’t go crazy from day one. Don’t use CronJobs right away. Run the experiments as one-shot executions first. Make sure that they work, and then create CronJobs that will do the same things repeatedly. Confidence is the key. Make sure that everybody in your organization is aware of what you do. Make sure that your colleagues understand the outcomes of those experiments. The goal of chaos experiments is not for you to have fun. They are meant to help learn something and to improve our systems. To improve the system, you need to propagate the findings of the experiments throughout the whole organization.
How to reach out
That’s about it. Thank you so much. I hope you found the course useful.
Reach out to me if you need anything. Contact me on Slack, Twitter, email, or send courier pigeons. I’ll do my best to make myself available for any questions, doubts, or issues. I will do my best to help you out.
With that, I bid you farewell. It was exhilarating for us (Darin and me) to create this course. See you on a different course or a book. You might see me at a conference or a workshop. Most of the things I do are public. My job is to help people improve, and I hope that I accomplished that with this course.
Get hands-on with 1400+ tech skills courses.