Home/Blog/System Design/Best practices for achieving low latency in System Design

Best practices for achieving low latency in System Design

18 min read

Sep 10, 2024

content

Understanding latency

Does latency matter to users?

How to determine the threshold of latency?

Key principles of low latency systems

7 best practices to achieve low latency

1. System architecture

2. Data management and optimization

3. Network design

4. Communication protocols

5. Code optimization

6. Hardware and infrastructure

7. Effective caching strategies

How we achieved low latency at Meta

How it works

Outcome

Designing for low latency

When I first started working on Meta’s distributed data store, I quickly realized that the biggest challenge wasn’t managing the vast amount of data—it was the relentless battle against latency. The struggle to reduce latency became a central focus of our work, pushing us to be more creative and think strategically about the system’s architecture—we’ll talk more about it later.

In System Design, latency is the delay between a request and a response from the system.

As software engineers, we all know that latency can make or break user experience—every millisecond matters. It’s a rigorous struggle that requires constant innovation and optimization. Understanding how to achieve low latency in System Design can be your secret to stealing the show during your System Design Interview, especially at FAANG.

Let’s explore some of the best practices for achieving low latency in systems, drawing from both personal experiences and industry-proven techniques.

Understanding latency#

We usually measure it in units of time, such as milliseconds (ms), and have the following main types to consider:

Network latency: The time a data packet takes to travel in a network from source to destination. It includes transmission, propagation, node processing, and queueing delays.
Application latency: The delay at the application server, including processing time, querying the database, time to perform any other computational tasks, etc.
Read or write latency: The delay in reading or writing data from or to storage devices such as disk or memory. It includes data seek time and transfer time.

When we talk about latency, we’re focusing on the following questions:

How quickly is the system acting on requests and sending responses back?
The answer is as quickly as possible.

How does the increasing number of requests affect the latency of the system?

The answer is it shouldn’t affect the latency at all.

We have to ensure the latency remains as minimal as possible to avoid user bounce rates.

First, let’s give an overview from users’ perspectives: Does latency matter to users?

Does latency matter to users?#

Google found through an experiment that a 200 ms delay in providing searched pages resulted in 0.22% fewer searches by users in the first three weeks and 0.36% fewer searches in the second three weeks. For a 400 ms delay in search results, searches were reduced by 0.44% in the first three weeks and 0.76% during the second three weeks. A 500 ms delay dropped traffic by 20%.

In similar experiments https://www.cloudflare.com/learning/performance/more/website-performance-conversion-rates/, Walmart’s conversion rate improved by 2% for every one-second improvement in the page load time. And it costs Amazon 1% of sales for every 100 ms delay in the page load time.

Even after resolving the latency issues, winning back users’ trust and engagement takes time and effort. For that, we, as software engineers, should focus on low-latency systems during the design phase based on some threshold values for different applications.

How to determine the threshold of latency?#

The faster a web page loads, the higher the chances of conversion. This means a user is highly likely to perform the intended operation if a targeted page loads quickly.

A load time of 2.4 seconds yields a 1.9% conversion rate.
A load time of 3.3 seconds yields a 1.5% conversion rate.
A load time of 4.2 seconds yields a <1% conversion rate.
A load time of 5.7+ seconds yields a <0.6% conversion rate.

According to studies, 47% of customers expect a page load time of less than 2 seconds. So, the conversion rate is the first threshold to consider.

The appropriate response time depends on specific use cases. In general, a system is efficient if its average response time is between 0.1 and 1 second. Also, on average, 100 ms of response time is effective for real-time applications such as gaming, chatting, live-streaming, etc.

Note: Educative's course, Grokking the API Design Interview, can help you understand how to achieve these optimal numbers in back-of-the-envelope calculations for latency.

Reduced website or application traffic due to high latency is a significant problem. As software engineers, we know latency isn’t just about speed—it’s about delivering a smooth, responsive experience. To tackle this problem head-on, we need to apply some best practices that can help us minimize latency and keep our systems running quickly and efficiently.

Let’s dive into the key principles of low latency systems to keep in mind while coming up with techniques to lower latency.

Key principles of low latency systems#

The following are the key principlesFrom the book, Designing Data-Intensive Applications by Martin Kleppmann derived from the common best practices and widely accepted methodologies to reduce the latency, as discussed by:

Minimizing data processing time: This involves optimizing the processing of data within a system to achieve faster computation and response times. Techniques to achieve minimum data processing time include efficient algorithms, parallel processing, and optimizing data storage and retrieval mechanisms.
Reducing network round-trip times: This focuses on decreasing the time it takes for data to travel between different points in a network. Strategies include minimizing the number of network interactions/hops, using efficient communication protocols, and leveraging data event-driven architecture to reduce latency.
Efficient resource management: This entails effectively allocating and utilizing computational resources such as CPU, memory, and storage within a system. Techniques include load balancing, partitioning data across nodes (sharding), and implementing caching strategies to optimize performance and lower the latency.

By now, we know that latency matters, and as software engineers, we should focus on solutions that help us achieve low latency. By understanding and applying techniques for these principles, we can significantly improve system performance. So, let’s explore strategies to help us achieve lower latency and more responsive applications.

Ready? Let’s get into it!

7 best practices to achieve low latency#

1. System architecture#

Achieving low latency in System Design requires careful consideration of the system architecture. Imagine you’re building a pizza shop in a city with two architectural choices.

The first one is like operating from a single, centralized location. It’s straightforward to manage, but as demand grows, you face delays in your operations while delivering a pizza to your customers–that’s monolithic architectureA monolithic architecture has all components of an application—the frontend, backend, databases, and other components—embedded in a single codebase..

The second option is to set up distributed pizza shop branches across the city, delivering to the nearest locations and operating independently to manage the operations—that’s microservice architectureMicroservices architecture divides the application into small, independent services that can be developed, deployed, and scaled separately..

We need to make a critical decision between monolithic and microservice architecture. Monolithic applications tend to have higher latency due to the strong interdependencies of components. Microservices tend to have lower latency because they are designed with modular, independent components that can scale and respond more efficiently to requests.

In today’s software landscape, monolithic architecture is no longer suitable for complex applications and is rapidly fading away.

For a deep dive into microservices, check out the article Why Use Microservices? The author explains the architectural insights and provides a comparative analysis.

Event-driven architecture is an approach that centers around events for data and applications. It enables applications to act or react to those events. A client gets the data whenever an event is triggered in the application without requesting the server. This approach helps eliminate the problemsProblems such as sending too many requests in short polling, using two separate connections for long polling, etc. of shortRequest frequently for updates from the server after a fixed short interval. The server responds whether it has an update or not. and long pollingRequest for updates from the server with the channel left open (based on some constraints), and the server responds when it has an update. and reduces the round-trip time as data has to travel one way now.

As you can see, the data has to travel one way, and whenever an event occurs, this cuts the round-trip time to half, lowering the latency. Moreover, asynchronous communication is another factor that helps to reduce latency. It benefits applications needing real-time updates or processing, such as trading platforms, gaming servers, real-time analytic systems, etc.

2. Data management and optimization#

Efficient database management and data access is critical for low latency:

We must choose the right database according to the nature of the data to be processed. SQL databases are great choices for structured data and complex queries such as customer information, orders, and product details. NoSQL databases are suitable for faster performance and flexible data models, such as social media posts, user comments, and real-time analytics.
After choosing the right database, we should optimize data retrieval. First, we can index our data and optimize queries to reduce execution time. Second, we can shard and replicate the database, allowing the system to scale and quickly retrieve data.
- Sharding: Splitting data across multiple databases or servers to distribute the load reduces latency by allowing parallel data access and processing.
- Replication: Creating copies of data across multiple servers to ensure high availability and faster access, thereby reducing latency by serving queries from the closest or least loaded replica.

We can also use in-memory databases like Redis or Memcache as a distributed cache to store frequently accessed data to reduce disk access times.

3. Network design#

Network design also plays a vital role in minimizing network latency:

Minimizing network hops: The fewer stops (network hops) data has to make, the faster it gets to where it needs to go. Think of it like a direct flight vs. a flight with multiple layovers. This is possible by peering agreementsAgreements with other ISPs to bypass intermediary networks for direct data exchange., optimizing routing protocolsUsing routing protocols such as border gateway protocol (BGP) to find the most efficient path for data., establishing private network linksSetting up private links between key locations to avoid public internet congestion., etc. For example, a trading company can create a direct connection between its data centers in different cities to give users a competitive edge by reducing the latency.
Content delivery network (CDN): CDNs bring content closer to users, which speeds up data delivery. For example, a user in New York accessing a website hosted in California would experience lower latency if the content is served from a CDN node in New York.
- Geographical distribution: Place CDN servers in different regions so that content is always served from a nearby location. For example, Netflix distributes videos to different geographical locations to minimize latency.
- Edge caching: Store frequently accessed content at the edge of the network, right where users are.
- Dynamic content acceleration: Use CDNs that speed up the delivery of dynamic (changing) content by optimizing how quickly data is fetched and delivered.
Load balancer: A load balancer is a component that balances the load between available servers to avoid overloading a single server. When the load is balanced, services are available to quickly handle a new request, reducing the wait time to execute or process––hence lowering the latency. We can opt for the following to balance the load:
- Round-robin load balancing: It rotates requests among different servers in a balanced way. For example, if you have four servers, the first request goes to server A, second goes to server B, and so on, and the cycle continues from A to D.
- Application load balancers: These are advanced load balancers that can make intelligent decisions based on server load and send requests accordingly. For example, the AWS elastic load balancer (ELB) can distribute incoming requests to the least busy servers.
- Geographic load balancing: It directs users to the server closest to them to cut down on data travel time. For example, a user in Europe is directed to a European server instead of a server in northern US.
- Least connections algorithm: This algorithm balances the load by directing traffic to the server with least active connections or requests. Session persistence load balancing is also useful, which maintains a user’s session and sends requests to the same server.

Communication Protocols

Protocols	Description	Example use case
HTTP/2	Enable multiple requests Multiplex responses from multiple resources	E-commerce websites, fetching images, product details, stylesheets, etc., simultaneously
User Datagram Protocol (UDP)	Connectionless protocol Faster data transmission	Multiplayer games like Fortnite uses UDP for real-time player interactions
Quick UDP Internet Connection (QUIC)	Faster connection Faster data transfer	Video streaming apps leverage QUICK to quickly establish a connection and start playing video
WebSockets	Full-duplex communication protocol Single, long-lived connection Real-time data exchange	Applications like WhatsApp, Slack, etc. use WebSockets to enable instant messaging delivery
Message Queuing Telemetry Transport (MQTT)	Lightweight messaging protocol Optimized for low-bandwidth and high-latency networks	Automotive companies like Tesla use MQTT to collect and transmit data from vehicles in real-time.
gRPC Remote Procedure Call	Remote procedure call t Uses HTTP/2 for transport and Protocol Buffers for serialization	Companies like Netflix use gRPC to enable fast and efficient communication between microservices

5. Code optimization#

The next best practice to lower latency is to optimize the code. We can opt for the following for code optimization:

We must use efficient algorithms to minimize complexity and execution time. For example, choosing quick sort over bubble sort for sorting operations reduces time complexity from $O(n^2)$ to $O(n\text{log}n)$ .
While talking about code optimization, we can’t escape from choosing the right data structures. For example, choosing hash tables for fast lookup operations, balanced binary search trees for efficient insertion and deletion, etc., can help optimize latency.
We must focus on reducing I/O operations, as they are slower than memory. We can batch multiple database queries to one or use in-memory databases like Redis. We can also opt for asynchronous processing for I/O tasks to unblock the main execution thread. For example, using async/awaitIt allows functions to run without blocking the execution of other code can significantly improve the responsiveness of the application.
We should leverage parallel processing or multi-threading to distribute workloads across multiple CPU cores. In Python, libraries like concurrent.futures or multiprocessing can help run CPU-intensive tasks in parallel.
We should remove unnecessary code, improve code that is a performance bottleneck, and optimize hot paths to improve execution time. We can reduce large files such as JavaScript and CSS files. We can also opt for just-in-time (JIT)JIT compilation involves translating code into machine code at runtime, allowing optimizations based on the current execution context. or ahead-of-time (AOT)AOT compilation translates code into machine code before runtime, producing a binary that can be executed directly. compilation for the same to optimize runtime performance.
Last but not least, we can use profiling tools to identify and target performance bottlenecks in our code. Tools like GNU profiler, Linux performance profiler, Visual Studio profiler, etc. can help you understand and analyze different performance metrics, including CPU time, memory usage, thread and database profiling, etc., allowing you to optimize the critical paths.

6. Hardware and infrastructure#

Along with the other practices to lower latency, we should also focus on efficient hardware and infrastructure for our systems, discussed as follows:

Selecting hardware components that are optimized for speed can significantly reduce latency. For example, choosing a solid state drive (SSD) over a hard disk drive (HDD).
Using already established cloud infrastructure specialized for low latency can be an optimal choice for systems. AWS, Google Cloud, and Azure provide services like direct interconnects, edge computing, and regional data centers designed to minimize latency. For example, the AWS global accelerator routes traffic to the optimal endpoint based on latency, ensuring faster response.

Remember: Implementing low-latency techniques is essential, but actively monitoring your system is even more crucial. By setting up real-time monitoring and alerting, and regularly testing load and performance, you can quickly identify and address issues, ensuring your system remains efficient and responsive.

7. Effective caching strategies#

Another important method to optimize latency is to use caching at different layers. Caching stores frequently accessed data in the cache memory, reducing the time it takes to access data compared to fetching it from the database.
The illustration below depicts how a simple cache operates:

Serving data through the cache is not optimal when we don’t know when to remove or update the data from a cache. We can do this with cache eviction policies, such as least recently used (LRU)A cache eviction policy to remove the least recently accessed data, considering it’ll also not be used in the near future. and least frequently used (LFU)A cache eviction policy that evicts data that is least frequently used, keeping in mind that data with less access frequency are less likely to be used..

Another aspect that matters the most while serving data through cache is consistency. It is possible that the data might have been updated in the backend, but we keep serving clients with outdated data from the cache. So synchronizing data is crucial by updating cache immediately to avoid providing stale data.

How we achieved low latency at Meta#

During my time at Meta, I worked with a team on a distributed storage system with a strong emphasis on low latency.

Challenge: At the time, Meta (then Facebook) faced the huge challenge of efficiently storing and retrieving massive data. Traditional databases weren’t cutting it for their scale and latency requirements.

Solution: We proposed a geographically distributed data store designed to provide efficient and timely access to the complex social graph for Facebook’s extensive user base.

How it works#

Facebook’s distributed storage system, including projects like TAO (The Association Object) and Scuba (scalable consistent update-based architecture), is designed to handle massive amounts of data efficiently.

TAO provides a geographically distributed data store optimized for the social graph, ensuring low-latency reads and writes. Scuba, on the other hand, enables real-time ad-hoc analysis of large datasets for monitoring and troubleshooting. These systems utilize replication, sharding, and caching to ensure data availability, consistency, and quick access, supporting Facebook’s large-scale and dynamic data needs.

We used aggressive caching (distributed caching) strategies to reduce trips to the database and provide data from distributed locations closer to users. The concept of objects (e.g., users, posts, pages) and associations—those “friends” relationships we all have on Facebook, helps to quickly respond to queries about user relationships. We used a specialized query engine optimized for both interactive and batch processing. This allows analysts and engineers at Facebook to run queries that retrieve insights from the stored data quickly.

Outcome#

With distributed storage, Facebook can handle billions of read and write operations every second, ensuring users have a smooth, low-latency experience, even during peak times.

Note: If you want to learn about distributed systems, Check out the Educative course: System Design Deep Dive: Real World Distributed Systems.

Designing for low latency#

Lowering latency in a system isn’t just about one big fix—it’s about a combination of smart choices.

As software engineers, we should follow these three key principles during design and development to achieve lower latency from the start: minimizing data processing time, reducing network hops, and efficient resource management. The techniques focusing on these three main principles can be leveraged to achieve the lower latency target or threshold we discussed earlier. Following these best practices will create faster, more responsive applications that keep your users happy and engaged.

Pop quiz: Imagine you’re designing a real-time online gaming platform where players worldwide compete in fast-paced games. Even milliseconds of delay can make a player win or lose the game.

Given the critical need for low latency, which five best practices would you prioritize to ensure a smooth and responsive gaming experience?

Think of a solution, then head to our free lesson on the multi-player game to see how we designed for low latency.

As a reminder, Educative has various courses that help you practice and understand designing for low latency. I've added a few below that you may find interesting.

Grokking the Modern System Design Interview

Grokking the Modern System Design Interview

System Design interviews now determine the seniority level at which you’re hired across Engineering and Product Management roles. Interviewers expect you to demonstrate technical depth, justify design choices, and build for scale. This course helps you do exactly that. Tackle carefully selected design problems, apply proven solutions, and navigate complex scalability challenges—whether in interviews or real-world product design. Start by mastering a bottom-up approach: break down modern systems, with each component modeled as a scalable service. Then, apply the RESHADED framework to define requirements, surface constraints, and drive structured design decisions. Finally, design popular architectures using modular building blocks, and critique your solutions to improve under real interview conditions.

26hrs

Intermediate

5 Playgrounds

18 Quizzes