AWS Kinesis Outage Affecting Many Organizations
Learn the causes and analysis of a major AWS Kinesis outage.
Amazon Kinesis allows to aggregate, process, and analyze real-time streaming data to get timely insights and react quickly to the information it provides. It continuously captures gigabytes of data from hundreds of thousands of sources per second. The Kinesis service’s front-end handles authentication, throttling, and distributes workloads to its back-end “workhorse” cluster via a database sharding. On November 25th, 2020, the Amazon Kinesis service was disrupted in the Northern Virginia (US-EAST-1) region, affecting thousands of other third-party services. The failure was significant enough to take out a large portion of the Internet services.
Sequence of events
-
According to Amazon, the event was triggered by adding a small capacity to the AWS front-end fleet of servers scheduled from 2:44 AM PST to 3:47 AM PST.
-
To further dig down the failure, it was identified that the addition of the new capacity caused all the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.
-
Due to exceeding the limit on threads, cache construction was failing to complete, and front-end servers were ending with useless shard-maps that left them unable to route requests to back-end clusters. ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy