When I first started working on Meta’s distributed data store, I quickly realized that the biggest challenge wasn’t managing the vast amount of data—it was the relentless battle against latency. The struggle to reduce latency became a central focus of our work, pushing us to be more creative and think strategically about the system’s architecture—we’ll talk more about it later.
In System Design, latency is the delay between a request and a response from the system.
As software engineers, we all know that latency can make or break user experience—every millisecond matters. It’s a rigorous struggle that requires constant innovation and optimization. Understanding how to achieve low latency in System Design can be your secret to stealing the show during your System Design Interview, especially at FAANG.
Let’s explore some of the best practices for achieving low latency in systems, drawing from both personal experiences and industry-proven techniques.
We usually measure it in units of time, such as milliseconds (ms), and have the following main types to consider:
Network latency: The time a data packet takes to travel in a network from source to destination. It includes transmission, propagation, node processing, and queueing delays.
Application latency: The delay at the application server, including processing time, querying the database, time to perform any other computational tasks, etc.
Read or write latency: The delay in reading or writing data from or to storage devices such as disk or memory. It includes data seek time and transfer time.
When we talk about latency, we’re focusing on the following questions:
How quickly is the system acting on requests and sending responses back?
The answer is as quickly as possible.
How does the increasing number of requests affect the latency of the system?
The answer is it shouldn’t affect the latency at all.
We have to ensure the latency remains as minimal as possible to avoid user bounce rates.
First, let’s give an overview from users’ perspectives: Does latency matter to users?
Google found through an experiment that a 200 ms delay in providing searched pages resulted in 0.22% fewer searches by users in the first three weeks and 0.36% fewer searches in the second three weeks. For a 400 ms delay in search results, searches were reduced by 0.44% in the first three weeks and 0.76% during the second three weeks. A 500 ms delay dropped traffic by 20%.
In similar
Even after resolving the latency issues, winning back users’ trust and engagement takes time and effort. For that, we, as software engineers, should focus on low-latency systems during the design phase based on some threshold values for different applications.
The faster a web page loads, the higher the chances of conversion. This means a user is highly likely to perform the intended operation if a targeted page loads quickly.
A load time of 2.4 seconds yields a 1.9% conversion rate.
A load time of 3.3 seconds yields a 1.5% conversion rate.
A load time of 4.2 seconds yields a <1% conversion rate.
A load time of 5.7+ seconds yields a <0.6% conversion rate.
According to studies, 47% of customers expect a page load time of less than 2 seconds. So, the conversion rate is the first threshold to consider.
The appropriate response time depends on specific use cases. In general, a system is efficient if its average response time is between 0.1 and 1 second. Also, on average, 100 ms of response time is effective for real-time applications such as gaming, chatting, live-streaming, etc.
Note: Educative's course, Grokking the API Design Interview, can help you understand how to achieve these optimal numbers in back-of-the-envelope calculations for latency.
Reduced website or application traffic due to high latency is a significant problem. As software engineers, we know latency isn’t just about speed—it’s about delivering a smooth, responsive experience. To tackle this problem head-on, we need to apply some best practices that can help us minimize latency and keep our systems running quickly and efficiently.
Let’s dive into the key principles of low latency systems to keep in mind while coming up with techniques to lower latency.
The following are the
Minimizing data processing time: This involves optimizing the processing of data within a system to achieve faster computation and response times. Techniques to achieve minimum data processing time include efficient algorithms, parallel processing, and optimizing data storage and retrieval mechanisms.
Reducing network round-trip times: This focuses on decreasing the time it takes for data to travel between different points in a network. Strategies include minimizing the number of network interactions/hops, using efficient communication protocols, and leveraging data event-driven architecture to reduce latency.
Efficient resource management: This entails effectively allocating and utilizing computational resources such as CPU, memory, and storage within a system. Techniques include load balancing, partitioning data across nodes (sharding), and implementing caching strategies to optimize performance and lower the latency.
By now, we know that latency matters, and as software engineers, we should focus on solutions that help us achieve low latency. By understanding and applying techniques for these principles, we can significantly improve system performance. So, let’s explore strategies to help us achieve lower latency and more responsive applications.
Ready? Let’s get into it!
Achieving low latency in System Design requires careful consideration of the system architecture. Imagine you’re building a pizza shop in a city with two architectural choices.
The first one is like operating from a single, centralized location. It’s straightforward to manage, but as demand grows, you face delays in your operations while delivering a pizza to your customers–that’s
The second option is to set up distributed pizza shop branches across the city, delivering to the nearest locations and operating independently to manage the operations—that’s
We need to make a critical decision between monolithic and microservice architecture. Monolithic applications tend to have higher latency due to the strong interdependencies of components. Microservices tend to have lower latency because they are designed with modular, independent components that can scale and respond more efficiently to requests.
In today’s software landscape, monolithic architecture is no longer suitable for complex applications and is rapidly fading away.
For a deep dive into microservices, check out the article Why Use Microservices? The author explains the architectural insights and provides a comparative analysis.
Event-driven architecture is an approach that centers around events for data and applications. It enables applications to act or react to those events. A client gets the data whenever an event is triggered in the application without requesting the server. This approach helps eliminate the
As you can see, the data has to travel one way, and whenever an event occurs, this cuts the round-trip time to half, lowering the latency. Moreover, asynchronous communication is another factor that helps to reduce latency. It benefits applications needing real-time updates or processing, such as trading platforms, gaming servers, real-time analytic systems, etc.
Efficient database management and data access is critical for low latency:
We must choose the right database according to the nature of the data to be processed. SQL databases are great choices for structured data and complex queries such as customer information, orders, and product details. NoSQL databases are suitable for faster performance and flexible data models, such as social media posts, user comments, and real-time analytics.
After choosing the right database, we should optimize data retrieval. First, we can index our data and optimize queries to reduce execution time. Second, we can shard and replicate the database, allowing the system to scale and quickly retrieve data.
Sharding: Splitting data across multiple databases or servers to distribute the load reduces latency by allowing parallel data access and processing.
Replication: Creating copies of data across multiple servers to ensure high availability and faster access, thereby reducing latency by serving queries from the closest or least loaded replica.
We can also use in-memory databases like Redis or Memcache as a distributed cache to store frequently accessed data to reduce disk access times.
Network design also plays a vital role in minimizing network latency:
Minimizing network hops: The fewer stops (network hops) data has to make, the faster it gets to where it needs to go. Think of it like a direct flight vs. a flight with multiple layovers. This is possible by
Content delivery network (CDN): CDNs bring content closer to users, which speeds up data delivery. For example, a user in New York accessing a website hosted in California would experience lower latency if the content is served from a CDN node in New York.
Geographical distribution: Place CDN servers in different regions so that content is always served from a nearby location. For example, Netflix distributes videos to different geographical locations to minimize latency.
Edge caching: Store frequently accessed content at the edge of the network, right where users are.
Dynamic content acceleration: Use CDNs that speed up the delivery of dynamic (changing) content by optimizing how quickly data is fetched and delivered.
Load balancer: A load balancer is a component that balances the load between available servers to avoid overloading a single server. When the load is balanced, services are available to quickly handle a new request, reducing the wait time to execute or process––hence lowering the latency. We can opt for the following to balance the load:
Round-robin load balancing: It rotates requests among different servers in a balanced way. For example, if you have four servers, the first request goes to server A, second goes to server B, and so on, and the cycle continues from A to D.
Application load balancers: These are advanced load balancers that can make intelligent decisions based on server load and send requests accordingly. For example, the AWS elastic load balancer (ELB) can distribute incoming requests to the least busy servers.
Geographic load balancing: It directs users to the server closest to them to cut down on data travel time. For example, a user in Europe is directed to a European server instead of a server in northern US.
Least connections algorithm: This algorithm balances the load by directing traffic to the server with least active connections or requests. Session persistence load balancing is also useful, which maintains a user’s session and sends requests to the same server.
As we can see, everything happens through communication, and if the protocols we use for communication are inefficient, we won’t achieve lower latency. The following table shows some key protocols and how they are fit for different applications to reduce latency:
Protocols | Description | Example use case |
HTTP/2 |
| E-commerce websites, fetching images, product details, stylesheets, etc., simultaneously |
User Datagram Protocol (UDP) |
| Multiplayer games like Fortnite uses UDP for real-time player interactions |
Quick UDP Internet Connection (QUIC) |
| Video streaming apps leverage QUICK to quickly establish a connection and start playing video |
WebSockets |
| Applications like WhatsApp, Slack, etc. use WebSockets to enable instant messaging delivery |
Message Queuing Telemetry Transport (MQTT) |
| Automotive companies like Tesla use MQTT to collect and transmit data from vehicles in real-time. |
gRPC Remote Procedure Call |
| Companies like Netflix use gRPC to enable fast and efficient communication between microservices |
The next best practice to lower latency is to optimize the code. We can opt for the following for code optimization:
We must use efficient algorithms to minimize complexity and execution time. For example, choosing quick sort over bubble sort for sorting operations reduces time complexity from
While talking about code optimization, we can’t escape from choosing the right data structures. For example, choosing hash tables for fast lookup operations, balanced binary search trees for efficient insertion and deletion, etc., can help optimize latency.
We must focus on reducing I/O operations, as they are slower than memory. We can batch multiple database queries to one or use in-memory databases like Redis. We can also opt for asynchronous processing for I/O tasks to unblock the main execution thread. For example, using async/await
We should leverage parallel processing or multi-threading to distribute workloads across multiple CPU cores. In Python, libraries like concurrent.futures
or multiprocessing
can help run CPU-intensive tasks in parallel.
We should remove unnecessary code, improve code that is a performance bottleneck, and optimize hot paths to improve execution time. We can reduce large files such as JavaScript and CSS files. We can also opt for
Last but not least, we can use profiling tools to identify and target performance bottlenecks in our code. Tools like GNU profiler, Linux performance profiler, Visual Studio profiler, etc. can help you understand and analyze different performance metrics, including CPU time, memory usage, thread and database profiling, etc., allowing you to optimize the critical paths.
Along with the other practices to lower latency, we should also focus on efficient hardware and infrastructure for our systems, discussed as follows:
Selecting hardware components that are optimized for speed can significantly reduce latency. For example, choosing a solid state drive (SSD) over a hard disk drive (HDD).
Using already established cloud infrastructure specialized for low latency can be an optimal choice for systems. AWS, Google Cloud, and Azure provide services like direct interconnects, edge computing, and regional data centers designed to minimize latency. For example, the AWS global accelerator routes traffic to the optimal endpoint based on latency, ensuring faster response.
Remember: Implementing low-latency techniques is essential, but actively monitoring your system is even more crucial. By setting up real-time monitoring and alerting, and regularly testing load and performance, you can quickly identify and address issues, ensuring your system remains efficient and responsive.
Another important method to optimize latency is to use caching at different layers. Caching stores frequently accessed data in the cache memory, reducing the time it takes to access data compared to fetching it from the database.
The illustration below depicts how a simple cache operates:
Serving data through the cache is not optimal when we don’t know when to remove or update the data from a cache. We can do this with cache eviction policies, such as
Another aspect that matters the most while serving data through cache is consistency. It is possible that the data might have been updated in the backend, but we keep serving clients with outdated data from the cache. So synchronizing data is crucial by updating cache immediately to avoid providing stale data.
During my time at Meta, I worked with a team on a distributed storage system with a strong emphasis on low latency.
Challenge: At the time, Meta (then Facebook) faced the huge challenge of efficiently storing and retrieving massive data. Traditional databases weren’t cutting it for their scale and latency requirements.
Solution: We proposed a geographically distributed data store designed to provide efficient and timely access to the complex social graph for Facebook’s extensive user base.
Facebook’s distributed storage system, including projects like TAO (The Association Object) and Scuba (scalable consistent update-based architecture), is designed to handle massive amounts of data efficiently.
TAO provides a geographically distributed data store optimized for the social graph, ensuring low-latency reads and writes. Scuba, on the other hand, enables real-time ad-hoc analysis of large datasets for monitoring and troubleshooting. These systems utilize replication, sharding, and caching to ensure data availability, consistency, and quick access, supporting Facebook’s large-scale and dynamic data needs.
We used aggressive caching (distributed caching) strategies to reduce trips to the database and provide data from distributed locations closer to users. The concept of objects (e.g., users, posts, pages) and associations—those “friends” relationships we all have on Facebook, helps to quickly respond to queries about user relationships. We used a specialized query engine optimized for both interactive and batch processing. This allows analysts and engineers at Facebook to run queries that retrieve insights from the stored data quickly.
With distributed storage, Facebook can handle billions of read and write operations every second, ensuring users have a smooth, low-latency experience, even during peak times.
Lowering latency in a system isn’t just about one big fix—it’s about a combination of smart choices.
As software engineers, we should follow these three key principles during design and development to achieve lower latency from the start: minimizing data processing time, reducing network hops, and efficient resource management. The techniques focusing on these three main principles can be leveraged to achieve the lower latency target or threshold we discussed earlier. Following these best practices will create faster, more responsive applications that keep your users happy and engaged.
Pop quiz: Imagine you’re designing a real-time online gaming platform where players worldwide compete in fast-paced games. Even milliseconds of delay can make a player win or lose the game.
Given the critical need for low latency, which five best practices would you prioritize to ensure a smooth and responsive gaming experience?
Think of a solution, then head to our free lesson on the multi-player game to see how we designed for low latency.
As a reminder, Educative has various courses that help you practice and understand designing for low latency. I've added a few below that you may find interesting.
Grokking the Modern System Design Interview
System Design interviews are now part of every Engineering and Product Management Interview. Interviewers want candidates to exhibit their technical knowledge of core building blocks and the rationale of their design approach. This course presents carefully selected system design problems with detailed solutions that will enable you to handle complex scalability scenarios during an interview or designing new products. You will start with learning a bottom-up approach to designing scalable systems. First, you’ll learn about the building blocks of modern systems, with each component being a completely scalable application in itself. You'll then explore the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process. Finally, you'll design several popular services by using these modular building blocks in unique combinations, and learn how to evaluate your design.
Another tip: AI mock interviews on Educative are a great way to prepare for curveballs in the interviews by simulating real-world design problems, curated by ex-MAANG employees.
Happy learning!
Free Resources