...

>

AWS Kinesis Outage Affecting Many Organizations

AWS Kinesis Outage Affecting Many Organizations

Discover how a simple capacity addition triggered the massive AWS Kinesis outage, causing cascading failures across dependent services. Learn critical lessons regarding operational readiness, rigorous testing, and managing dependencies in complex distributed System Design.

Amazon Kinesis ingests and processes real-time streaming data. Its front end handles authentication, throttling, and routes workloads to backend clusters using sharding. On November 25, 2020, a disruption in the US-East-1 region caused widespread outages across internet services and third-party applications.

Sequence of events

  • Trigger: Amazon added capacity to the front-end server fleet between 2:44 a.m. and 3:47 a.m. PST.

  • Root cause: The new capacity caused the fleet to exceed the maximum thread count allowed by the operating system configuration.

  • System failure: Exceeding the thread limit caused cache construction to fail. Front-end servers ended up with useless ...