Search⌘ K

System Failure, Not Human Error

Explore the deeper causes of system failures by analyzing incidents where human error is a symptom, not the root cause. This lesson helps you understand how control plane tools and processes can fail humans, how repeated playbook use may hide risks, and the importance of learning from both failures and near misses to build more resilient distributed systems.

Amazon outage

Amazon clearly states that:

"[a]n authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Parsing that just a little bit, we can understand that someone mistyped a command. First and foremost, whoever that was has our deepest sympathies. I ...