Home/Blog/System Design/Amazon system design case study: How Amazon scales for Prime Day
Home/Blog/System Design/Amazon system design case study: How Amazon scales for Prime Day

Amazon system design case study: How Amazon scales for Prime Day

Fahim ul Haq
14 min read
content
Amazon Prime Day outage
Causes of the outage
Importance of scalability and availability
What are the optimal 9s of availability?
Isn’t Amazon Web Services (AWS) good enough?
An overview of Amazon’s e-commerce infrastructure
Techniques to ensure scalability and availability
Key takeaways from the Amazon outage
Conclusion
Related Guides
share

Imagine this: You’ve been eagerly counting down the days until Amazon Prime Day, waiting to catch that perfect deal. Finally, the moment arrives. You rush to the site... only to face error messages and pages that won’t load. Frustrating, right?

This was the reality for millions of shoppers during Amazon’s 2018 Prime Day outage. Let’s unpack what actually happened from a system design perspective

Amazon Prime Day outage

Amazon experienced a significant outage during its Prime Day event in 2018. The outage lasted for several hours and left a noticeable impact on the shopping experience of millions of customers and, of course, on the business itself.

Impact: The main Amazon website and mobile app were affected, with users reporting issues accessing product pages, completing purchases, and loading the home page. According to the report, Amazon lost over $100 million in sales during the downtime — an estimated $1.2 million per minute.

Amazon's main page on Prime Day 2018
Amazon's main page on Prime Day 2018

Causes of the outage

While Amazon has not disclosed all the technical details, we can analyze the causes with the industry knowledge and engineering principles based on a CNBC report, which includes the following key factors to failure:

  1. The first and most important factor is the higher-than-expected spike in traffic, which exceeded Amazon’s capacity despite the preparation. This observation is obvious from Amazon’s steps, which included falling back to a simpler frontend page and blocking international traffic.

  2. Though Amazon used a robust microservices architecture, it might have faced issues where service dependencies caused cascaded failures. This was also reported for other Amazon services, such as Twitch, Alexa, Prime Now, etc.

  3. While Amazon uses autoscaling to handle surges in traffic, there were delays in scaling up resources due to configuration issues with Amazon’s systems and services. This forced Amazon to manually deploy servers to meet the traffic load, which cost time and money.

The CNBC report shows that the surge in traffic was one of the main factors Amazon couldn’t handle—it made the service unavailable. The system received approximately 63.5 million requests per second on its storage and computational service (Sable).

In this blog, I’ll explore the technical intricacies of scaling for traffic during high-demand periods, using Amazon as our case study. We'll break down the strategies and approaches that ensure a smooth service, no matter how many people search for products, click "add to cart", or place orders. Whether you’re a tech enthusiast, an e-commerce entrepreneur, or just someone who wants to understand what happens behind the scenes during these mega-sale events, this journey into Amazon’s world of scalability will be enlightening.

Let’s explore the importance of availability bound to scalable systems.

Importance of scalability and availability

Scalability is the secret sauce allowing companies like Amazon to handle massive traffic spikes without breaking a sweat. It’s about having the infrastructure to flexibly expand and contract to meet demand. The 2018 Prime Day outage wasn’t just a hiccup; it was a wake-up call that even the best in the business can stumble. But many people don’t realize that ensuring scalability isn’t just about handling spikes in traffic; it’s intrinsically linked to availability.

Let’s look at how Amazon processes a user’s request. A single API request, such as purchasing an item or performing a search, fans out to multiple backend services like inventory, payment, shipping, index, and recommendation. Each service must handle part of the request to fulfill it successfully. When scaling, all these backend services need to be scaled together to manage increased load effectively, as seen in the purchasing and search examples below:

A depiction of fanning out requests to multiple subservices
A depiction of fanning out requests to multiple subservices

Availability—keeping the services up and running—is the backbone of any online business. A few minutes of downtime can translate into millions of dollars in lost sales, and a spoiled reputation is additional. Availability is important for the success of a service and seamless user experience. It means that users can access the service and perform intended operations, whenever they want. In service level agreements (SLAs), availability is defined in percentage of time, usually represented by 9s, such as 99.999% of up time of the service.

When we scale a system to handle a surge in demand, it is also crucial to ensure service availability. No matter how much we scale, if we don’t have replicas or failover mechanisms to ensure availability, the service will likely fail, and vice versa.

What are the optimal 9s of availability?

When deploying services, an uptime of 99.999% (five 9s) is assumed as the perfect uptime. This allows for only 0.001% of downtime. Let’s look at some calculations to see what that entails:

A downtime of 0.001% equals less than 6 minutes–a service will be down for around 6 minutes in an entire year. Now, 6 minutes in a year doesn’t sound like much, but for services like Amazon, a minute can cost them around $1.2 million in sales, especially on events like Prime Day.

Let’s estimate the revenue of Amazon Prime Day for 2024 by looking at their sales history. Take a look at the graph below:

The expected revenue of Amazon in 2024 (in green).

Note: The expected revenue of Amazon Prime Day 2024 is based on the average annual revenue difference ($1.43 billion) from the above data.

Based on the expected revenue of 2024, Amazon will make approximately $5 million per minute in sales. Now, let’s see how the 9s of availability affect the downtime and, hence, the loss in revenue of Amazon Prime Day:

Nines (9s) of availability

Availability

Downtime per Year

Downtime per Month

Downtime per day

Revenue loss per day

1 nine –– 90%

36.5 days

72 hours

2.4 hours

$720 million

2 nines –– 99.0%

3.65 days

7.20 hours

14.4 minutes

$72 million

3 nines –– 99.9%

8.76 hours

43.8 minutes

1.46 minutes

$7.3 million

4 nines –– 99.99%

52.56 minutes

4.32 minutes

8.64 seconds

$0.72 million

5 nines –– 99.999%

5.26 minutes

25.9 seconds

0.86 seconds

$0.072 million

The above back-of-the-envelope calculations estimate the loss if Amazon faces downtime on the next prime day (2024).

The problem with achieving five or six nines of availability is that it costs a lot of money and effort. The good thing is that you don’t need to have five or six nines of daily availability—except for days like Amazon Prime Day. The cost of achieving five 9s is so high that the returns do not justify the investment.

Before you proceed!

Before you proceed!

Isn’t Amazon Web Services (AWS) good enough?

We know AWS (Amazon’s cloud service) is the powerhouse of tools and services designed to ensure applications remain up and running. It provides everything from autoscaling to elastic load balancing (ELB) to meet the surges that Amazon faced on its Amazon Prime Day. Despite the best-laid tools, Amazon still faced significant downtime.

I had a similar thought after the outage, but when I focused on the real causes, I realized that the culprit often isn’t the tools themselves but how they are configured and utilized. The real challenge lies in employing these tools correctly and efficiently. It’s a reminder that even with AWS’s capabilities, meticulous attention to detail in setup and management is crucial. Engineers need to ensure that every aspect of the infrastructure is finely tuned and that there’s a robust plan for handling unexpected issues.

Let’s understand the design of the Amazon system by focusing on availability and scalability.

An overview of Amazon’s e-commerce infrastructure

Amazon started as an online bookstore and is now the biggest e-commerce store with numerous products. Let’s see how the Amazon e-commerce platform works for selling and purchasing goods.

The Amazon system needs to be highly scalable and available to provide millions of transactions per day in events such as Amazon Prime Day. Yes, you guessed it right, it should be performant and consistent as well, which, for the time being, will be considered secondary metrics; the primary metrics are scalability and availability. Let’s have a walkthrough of the Amazon backend infrastructure as given in the following high-level design:

A high-level design of Amazon’s e-commerce system
A high-level design of Amazon’s e-commerce system

Initially, the users can visit the homepage to search for an item of interest. The search service uses the Elasticsearch system for several key advantages, such as performance, scalability, full-text search features, etc. At the back of the homepage is a recommendation service that provides recommendations to users based on their purchases or search history. If the user is new, they are shown some trending products on the home page. Once the users select items and add them to the shopping cart, the cart service is triggered at the backend, which keeps the items in the relevant databases.

The order service handles incoming orders from users and stores detailed information about the order, including customer and product information, in the databases. This service is responsible for capturing delivery address and the relevant information. When users aim to pay and purchase the item, the purchase order is invoked, which interacts with the payment gateway, which is, in turn, responsible for charging the due amount from the customer’s credit card or account. The pub-sub system decouples various services and lets them communicate asynchronously.

Techniques to ensure scalability and availability

Let’s look at some techniques that can be adopted to ensure the scalability and availability of the system. The most prominent techniques are:

  • Database replication, distributed caches, and backups: Database replication techniques ensure data availability and significantly enhance scalability. Using the right techniques to store and retrieve data is essential to maintaining seamless service continuity. This includes replicating data across multiple databases and regions, partitioning data according to application needs, and maintaining different storage solutions for different data types. Techniques like distributed caching enhance performance by storing frequently accessed data closer to where it is needed, efficiently managing a large volume of incoming requests. Additionally, incorporating regular backups is crucial for disaster recovery. While these strategies require cost and effort, they are crucial for keeping the system reliable and ready for unexpected events.

  • Load balancing and autoscaling groups: This is a no-brainer. Impleme­nting load balancing involves distributing incoming traffic among targets like containers, servers, and data ce­nters. This prevents ove­rwhelming a single serve­r and allows another to step in if nee­ded. Similarly, autoscaling groups let a system adjust active­ instances based on demand le­ading to enhanced availability, performance, and user experie­nce.

  • Redundancy of services: By duplicating critical services across multiple zones and regions, a system can ensure that even if a service in one region fails, another can take over seamlessly. This redundancy minimizes downtime and latency, handles large requests, and maintains the platform’s reliability, providing a consistent user experience.

  • Content Delivery Networks (CDNs): Utilize CDNs to deliver content to users with low latency and high transfer speeds. By caching content at edge locations worldwide, they ensure that users receive data from the nearest server, reducing load times and improving the overall user experience. It also reduces the burden on the origin servers. CDNs are essential for efficiently handling large traffic and improving availability.

  • Monitoring and auto-recovery mechanisms: Effective monitoring ensures a system’s health and performance. Utilizing tools for real-time monitoring of system metrics (such as response time, resource utilization, access logs, temperature, security events, etc.) and setting up automated alerts helps engineering teams promptly address potential issues. Implementing auto-recovery processes, such as restarting failed instances or triggering automatic failovers, is essential for maintaining service continuity and minimizing downtime. These techniques are vital for proactive system management and ensuring reliable operation under varying conditions.

  • Testing: Thorough load tests simulating peak traffic conditions are crucial. These tests should account for the worst-case scenarios, ensuring all systems can handle extreme loads.

You might think you can get by without the techniques above, and maybe you can—until that one bad day strikes. Being prepared can make all the difference. Trust me, you’ll be grateful you invested in these strategies when that day comes.

In reality, ensuring a seamless experience, even on big days, requires a lot of engineering brilliance and commitment to consistent improvement. That’s only the first half of the problem. What happens if there is a failure? That’s the second half! You need standby engineering teams on high alert with contingency plan(s) to ensure you recover quickly and successfully.

Let’s see what techniques Amazon uses to ensure scalability and availability in the following table:

Strategies for Scalability and Availablity

Amazon’s Techniques and Services



Database replication and backups

  • Amazon RDS for multi-availability zone deployment
  • Amazon DynamoDB for fully managed multiregion, multi-active data replication
  • Amazon Aurora for synchronous replication for high availability within a region and asynchronous replication for cross-origin data redundancy
  • Amazon DocumentDB for multiprimary replication across multiple availability zones

Distributed cache

  • Amazon ElasticCache is the distributed cache solution used for high performance, scalability, and availability



Load balancing and Autoscaling

  • Amazon Elastic load balancing for distributing different types of traffic across various services
  • Amazon EC2 autoscaling for automatically adjusting the number of EC2 instances
  • Amazon ECS and EKS autoscaling for containerized applications

Content Delivery Networks (CDNs)

  • Amazon CloudFront for placing the content near users



Monitoring and autorecovery mechanisms

  • Amazon CloudWatch for monitoring and observing the services
  • Amazon EC2 Auto Recovery enables automatic recovery of EC2 instances
  • AWS Systems Manager provides a unified user interface for managing AWS resources
  • AWS Health provides personalized alerts and recovery guidance in the event of failure

Testing

  • Amazon GameDay simulates failures to prepare for unforeseen situations

Key takeaways from the Amazon outage

Amazon’s 2018 Prime Day outage highlights several important lessons for managing high-demand periods for engineering teams and tech enthusiasts:

  • Building a perfect system is impossible, but there is nothing wrong with trying to achieve a perfectly designed system. How do you achieve that? Make your design as simple as possible and evolve slowly to avoid useless complexities.

  • Blame the configuration management or load testing teams; systems inevitably fail, even tech giants like Amazon, Meta, Google, etc. Surprisingly, if you visit independent services like Downdetector, you may find your favorite application struggling to provide service in some parts of the world. The important thing here is the preparation for such incidents, the mitigation techniques to employ before they happen, a contingency plan after they occur, and the learning that comes from failures.

  • Faulty capacity estimation is one of the main reasons why systems fail. You design systems and prepare for the traffic you estimate. A faulty estimation leads to a faulty design. Therefore, always base your math on meaningful assumptions and intelligent guesses to make informed design decisions.

How do you estimate resources, and what kind of resources do you typically estimate for are good questions, but for another day. However, if you are curious, find answers to these questions in the back-of-the-envelope calculation section of the Grokking Modern System Design course.

Conclusion

A system like Amazon needs to analyze user behavior and usage patterns and predict traffic spikes. This can be achieved via advanced analytics, where metrics such as peak viewing times, popular content, and regional user distribution are continuously monitored. Furthermore, traffic patterns can be predicted through big data analytics technology, and the system can be prepared for potential surges. Simultaneously, monitoring systems can provide great insight into current usage and can provide better resource management.

To handle peak loads effectively, it’s essential to introduce autoscaling systems that adjust the number of servers according to the demand. AWS utilize autoscaling groups to automatically add or remove instances based on fluctuations in traffic. The combination of predictive analytics with reactive scaling and load balancing will enable the system to manage peak loads optimally without sacrificing performance.

To keep the learning going, I recommend this popular system design course:

Start Grokking Modern System Design Today

Cover
Grokking the Modern System Design Interview

System Design interviews are now part of every Engineering and Product Management Interview. Interviewers want candidates to exhibit their technical knowledge of core building blocks and the rationale of their design approach. This course presents carefully selected system design problems with detailed solutions that will enable you to handle complex scalability scenarios during an interview or designing new products. You will start with learning a bottom-up approach to designing scalable systems. First, you’ll learn about the building blocks of modern systems, with each component being a completely scalable application in itself. You'll then explore the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process. Finally, you'll design several popular services by using these modular building blocks in unique combinations, and learn how to evaluate your design.

26hrs
Intermediate
5 Playgrounds
18 Quizzes