The Simian Army

Learn about chaos monkey, chaos engineering, robustness, opt-in and opt-out environments.

We'll cover the following...

Chaos monkey

Probably the best known example of chaos engineering is Netflix’s Chaos Monkey. Every once in a while, the monkey wakes up, picks an autoscaling cluster, and kills one of its instances. The cluster should recover automatically. If it doesn’t, then there’s a problem and the team that owns the service has to fix it.

The Chaos Monkey tool was born during Netflix’s migration to Amazon’s AWS cloud infrastructure and a microservice architecture. As services proliferated, engineers found that availability could be jeopardized by an increasing number of components. Unless they found a way to make the whole service immune to component failures, they would be doomed. So every cluster needed to autoscale and recover from failure of ...