Causes of the outage#
While Amazon has not disclosed all the technical details, we can analyze the causes with the industry knowledge and engineering principles based on a CNBC report, which includes the following key factors to failure:
The first and most important factor is the higher-than-expected spike in traffic, which exceeded Amazon’s capacity despite the preparation. This observation is obvious from Amazon’s steps, which included falling back to a simpler frontend page and blocking international traffic.
Though Amazon used a robust microservices architecture, it might have faced issues where service dependencies caused cascaded failures. This was also reported for other Amazon services, such as Twitch, Alexa, Prime Now, etc.
While Amazon uses autoscaling to handle surges in traffic, there were delays in scaling up resources due to configuration issues with Amazon’s systems and services. This forced Amazon to manually deploy servers to meet the traffic load, which cost time and money.
The CNBC report shows that the surge in traffic was one of the main factors Amazon couldn’t handle—it made the service unavailable. The system received approximately 63.5 million requests per second on its storage and computational service (Sable).
In this blog, I’ll explore the technical intricacies of scaling for traffic during high-demand periods, using Amazon as our case study. We'll break down the strategies and approaches that ensure a smooth service, no matter how many people search for products, click "add to cart", or place orders. Whether you’re a tech enthusiast, an e-commerce entrepreneur, or just someone who wants to understand what happens behind the scenes during these mega-sale events, this journey into Amazon’s world of scalability will be enlightening.
Let’s explore the importance of availability bound to scalable systems.
Importance of scalability and availability#
Scalability is the secret sauce allowing companies like Amazon to handle massive traffic spikes without breaking a sweat. It’s about having the infrastructure to flexibly expand and contract to meet demand. The 2018 Prime Day outage wasn’t just a hiccup; it was a wake-up call that even the best in the business can stumble. But many people don’t realize that ensuring scalability isn’t just about handling spikes in traffic; it’s intrinsically linked to availability.
Let’s look at how Amazon processes a user’s request. A single API request, such as purchasing an item or performing a search, fans out to multiple backend services like inventory, payment, shipping, index, and recommendation. Each service must handle part of the request to fulfill it successfully. When scaling, all these backend services need to be scaled together to manage increased load effectively, as seen in the purchasing and search examples below: