Distributed Systems: Building Software for the Real World/

...

System Failure, Not Human Error

Learn about the Amazon outage, how human error has its consequences and how anomalies can be interpreted.

We'll cover the following...

Amazon outage
Human error
Observing anomalies

Amazon outage

Amazon clearly states that:

"[a]n authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

Parsing that just a little bit, we can understand that someone mistyped a command. First and foremost, whoever that was has our deepest ...

Living in Production

The Exception That Grounded an Airline

Stabilize Your System

Stability Antipatterns

Failures And Blockages

Force Multiplier

Stability Patterns

Launching An Online Store

Foundations

Processes on Machines

Interconnect

Control Plane

Security

Design for Deployment

Handling Versions

Case Study: Trampled by Your Own Customers

Adaptation

System Architecture

Information Architecture

Chaos Engineering

Bibliography

System Failure, Not Human Error

Amazon outage