System Design has been gaining popularity ever since the term “distributed computing” was coined. It has become standard practice to have it as an interview phase for software engineering/development roles. So, what is System Design? Put simply, it is the process of understanding a system’s requirements and creating an infrastructure to satisfy them.
System Design interviews provide a comprehensive evaluation of candidates’ abilities. By tackling open-ended problems, candidates demonstrate their technical knowledge, creativity, and communication skills—a highly sought-after combination in 2024’s tech landscape.
Want to dive into the exciting world of System Design?
System Design interviews are now part of every Engineering and Product Management Interview. Interviewers want candidates to exhibit their technical knowledge of core building blocks and the rationale of their design approach. This course presents carefully selected system design problems with detailed solutions that will enable you to handle complex scalability scenarios during an interview or designing new products. You will start with learning a bottom-up approach to designing scalable systems. First, you’ll learn about the building blocks of modern systems, with each component being a completely scalable application in itself. You'll then explore the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process. Finally, you'll design several popular services by using these modular building blocks in unique combinations, and learn how to evaluate your design.
In this primer, we’ll unravel the mysteries of System Design, explore its foundational principles and the interview process, and equip you with the knowledge and confidence to grok your next System Design interview.
Of course, there are hundreds of terms and concepts in System Design. However, based on my experience as both a candidate and an interviewer in several System Design interviews, the following topics are considered the highest priority.
Operating systems (OS) are the backbone of modern computing, but their role extends far beyond simply running applications. For System Designers, a deep understanding of OS internals is crucial. This includes knowledge of process management, memory allocation, concurrency models, and file systems. Familiarity with different OS architectures (e.g., monolithic, microkernel) and their performance implications can also be invaluable in making informed design decisions.
The ultimate guide to Operating Systems
When it comes to operating systems, there are three main concepts: virtualization, concurrency, and persistence. These concepts lay the foundation for understanding how an operating system works. In this extensive course, you'll cover each of those in its entirety. You'll start by covering the basics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and API virtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks, semaphores, and how to triage concurrency bugs like deadlocks. Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and file systems. By the time you're done, you'll have mastered everything there is to know about operating systems.
A deep understanding of operating system (OS) concepts is non-negotiable in System Design. From scheduling and resource allocation to process management and concurrency models, these foundational principles empower you to craft robust, efficient, and scalable systems. You’ll gain a competitive edge in System Design interviews by mastering how your design components interact within an OS environment and how user requests are handled. Let’s dive into the essential OS knowledge you need to elevate your System Design skills.
Concurrency, the simultaneous execution of multiple tasks, is a fundamental challenge and opportunity in System Design. While
“Concurrency is a way to structure a software system by decomposing it into components that can be executed independently.” — Rob Pike, co-creator of the Go programming language.
Modern computing environments, especially distributed systems like web farms and cloud infrastructure, rely heavily on concurrency to achieve scalability and performance. However, this concurrency necessitates robust thread synchronization and coordination mechanisms to ensure data consistency, prevent
Locks: This is the most fundamental synchronization primitive. Various types of locks exist, including mutexes (mutual exclusion locks), read-write locks, and spinlocks, each with different performance characteristics and use cases.
Semaphores: These are generalized locks that allow a limited number of threads to access a resource concurrently. Semaphores are often used to control access to a pool of resources or to implement producer-consumer scenarios.
Condition variables: These enable threads to wait for a specific condition to become true before proceeding. Condition variables are typically used with locks to create more complex synchronization patterns.
Barriers: These are synchronization points where threads wait until all threads in a group have reached the barrier before continuing. Barriers are useful for coordinating the execution of parallel tasks that must be completed in phases.
However, there is one common issue with concurrency: synchronization.
When two independent processes run in parallel, synchronization is not a problem (they do not share resources). However, when dealing with dependent processes, synchronizing them is the key to successful concurrency.
Synchronization is the cornerstone of concurrent programming. It’s the art and science of coordinating access to shared resources among multiple processes. Just as traffic lights manage the flow of vehicles, synchronization mechanisms like mutexes (mutual exclusion locks), semaphores, condition variables, and monitors, each with unique strengths and use cases, regulate access to shared data. Choosing the right mechanism ensures correctness and performance in concurrent systems.
We use
After conducting numerous technical interviews, I observed that operating systems is an area where many candidates struggle.
Distributed systems introduce additional challenges due to the lack of shared memory and the potential for network delays and failures. Several algorithms have been developed to address these challenges:
Distributed locks: These implement mutual exclusion in a distributed environment. Common approaches include centralized coordinator-based locks, token-based locks, and quorum-based locks.
Lamport’s logical clocks: These provide a way to order events in a distributed system without relying on physical clocks. Logical clocks are used to ensure causal ordering and to detect inconsistencies.
Vector clocks: These are an extension of Lamport’s logical clocks that provide a more accurate representation of causality in distributed systems. Vector clocks are used in applications like conflict resolution in distributed databases and collaborative editing tools.
Paxos and Raft: These are consensus algorithms that enable a group of distributed processes to agree on a single value even in the presence of failures. Paxos and Raft are used in distributed databases, file systems, and other systems requiring strong consistency guarantees.
While modern operating systems excel at managing concurrent processes on a single machine, their capabilities are inherently confined to that one system. This limitation can be overcome by harnessing the power of computer networks, enabling the distribution of processes across multiple machines, and unlocking new levels of parallelism and scalability.
Continue reading about operating systems with this fundamentals course: Operating Systems: Virtualization, Concurrency & Persistence.
When I started out, I always wondered how computers communicate so well over such vast distances. Then, I discovered that this was a whole field of study called computer networks. You might ask, what is that? A computer network is a collection of interconnected devices that can share information and resources. It is a way for two independent systems to communicate. They are like IPC but for separate machines. The machines that need to communicate are usually connected through a
How
Bus topology: A single cable (the “bus”) connects all nodes linearly. It is simple and inexpensive but vulnerable to single points of failure.
Star topology: Each node connects to a central hub or switch. Offers better fault tolerance than bus topology but requires more cabling.
Ring topology: Nodes are connected in a circular chain, with data flowing in one direction. Offers good fault tolerance and predictable performance but can be complex to manage.
Mesh topology: Nodes are interconnected with multiple redundant links. Highly reliable and fault-tolerant but expensive to implement.
When sending messages through the network, we must be very specific in crafting those messages. These messages must pass through several hurdles (routers, switches, etc.), and a message meant for machine X mustn’t arrive at machine Y. You can see why this would confuse the machines if it were to happen.
Computer networks form the backbone of the internet, the infrastructure for System Design. Knowing the concepts of computer networks put me above the average candidate in System Design interviews as I could go deeper into the implementation details, which, to an interviewer, shows the depth of your understanding.
The OSI (Open Systems Interconnection) model divides
To achieve this, the two key protocols for network-based communication are TCP (Transmission Control Protocol) and UDP (User Datagram Protocol).
If computer networks are like the highways of the digital world, then network communication protocols are the traffic rules. They govern how data travels, ensuring it reaches its destination efficiently and accurately. Let’s delve into a couple of these protocols and explore how they shape our online experiences.
Pro tip: Consider your application’s specific requirements when choosing a network protocol. Do you need high-speed data transfer? Reliability? Real-time communication?
Think of TCP as the postal service—it ensures your data arrives reliably, in the right order, and with a confirmation. UDP is more like an announcer, quickly sharing information without checking if everyone heard it. TCP is used for things like email and web browsing, where accuracy is vital. UDP is used for live video streaming, where speed is more important than perfect delivery.
Now, what if you want two devices on separate networks to communicate? Let’s look at application layer protocols.
HTTP (Hypertext Transfer Protocol) is your browser’s language for talking to websites. It allows you to request web pages and send data back. In this scenario, your machine is the browser, requesting information from another machine elsewhere (sometimes across the globe). It’s a bit like a conversation, with each request and response forming a step in the dialogue. Whenever you have cross-network communication, you can be sure that HTTP is in play.
This protocol started the revolution for apps on the internet. Everything from social media and emails to online games uses HTTP.
FTP (File Transfer Protocol) and SMTP (Simple Mail Transfer Protocol), two other fundamental protocols, also play crucial roles in shaping the internet as we know it today. FTP enables the transfer of files between servers and clients, facilitating the sharing and distribution of digital content. SMTP, on the other hand, is responsible for the delivery of emails, allowing users to communicate with each other across the globe.
When designing APIs in a System Design interview, one of the most asked questions is what version of HTTP you would use (HTTP/1.1 or HTTP/2.0) and why.
With the advent of the internet, communication between different machines became essential. Everyone started to find better and safer ways to communicate between networks. That’s how we got RPCs and APIs.
I had always heard of terms like REST and GraphQL as the “essentials for any web application.” This confused me, as REST was simply a framework for building
REST, or Representational State Transfer, is an architectural style for designing networked applications. It relies on a stateless, client-server communication model and leverages HTTP protocols, making it a highly scalable and flexible approach for building APIs. This simplicity and wide adoption have made REST a cornerstone of modern web development, facilitating seamless integration between diverse systems.
Having built several APIs, I have found that REST is particularly effective for public APIs due to its simplicity and statelessness, which enhances scalability and reliability.
GraphQL, on the other hand, is a query language for APIs and a runtime for executing those queries by using a type system you define for your data. It offers a more efficient, powerful, and flexible alternative to REST. With GraphQL, clients can request the data they need, reducing over-fetching and under-fetching of data.
These frameworks were so essential for web development and distributed systems that they caused a paradigm shift. We now have entire system architectures (System Designs) built to support REST or GraphQL.
When I mastered these concepts, I felt I had learned what I needed to know. Then I heard about RPCs and how they are superior in some applications. I scratched my head because I thought REST could do anything. Let me tell you why RPCs change the game.
Remote procedure calls (RPC) allow a program on one computer to execute a function or procedure on another computer as if it were running locally. The request message is sent to the remote computer, which executes the requested function and sends back a response message.
RPCs are used in various applications, from distributed file systems to web services. They simplify the development of distributed systems by abstracting away the details of network communication.
Google remote procedure calls (gRPC) is a high-performance, open-source RPC framework developed by Google. It leverages modern technologies like
gRPC offers several advantages over traditional RPC:
Performance: gRPC is designed for high throughput and low latency, making it ideal for modern microservices architectures.
Language: gRPC supports various programming languages, allowing you to build distributed systems using your preferred tools.
Streaming: gRPC supports client-side and server-side streaming, enabling efficient communication for real-time applications.
Note: While gRPC offers numerous advantages, it’s important to consider its learning curve and additional tooling requirements before adopting it.
There are many different network communication protocols, each designed for specific purposes. The most popular ones are listed below.
Protocol | Description | Use Cases | Strengths | Weaknesses |
TCP | Provides reliable, ordered delivery of data. | Web browsing, email, and file transfer | It is reliable and guarantees data delivery. | It is slower than UDP due to error checking and retransmission overhead. |
UDP | Provides fast, connectionless delivery of data. | Video streaming, online gaming, and DNS lookups | It is fast, with low latency. | It is unreliable and may drop packets, unsuitable for applications requiring guaranteed data delivery. |
HTTP | Used for transferring hypertext (web pages). | Web browsing and APIs | It is simple and widely supported. | It is stateless, and each request/response is independent; it can be inefficient for frequent communication. |
REST | Allows stateless communication between devices. | Public APIs, microservices, and web applications | It offers simplicity, ease of use, wide adoption, and scalability. | It presents an over-fetching/under-fetching issue. |
GraphQL | Query language for APIs. | Complex client-server applications (mobile and web) | It offers precise data fetching, flexibility in evolving APIs, and strong typing. | It offers a steep learning curve, harder to implement caching. |
RPC | Enables a program on one computer to execute a procedure on another computer. | Distributed systems, microservices, and remote file systems | It simplifies the development of distributed systems and abstracts network communication details. | If not implemented properly, it can be slower than local procedure calls and potential security vulnerabilities. |
gRPC | A high-performance, open-source RPC framework. | Microservices, cloud-native applications, and real-time communication | It is efficient, language-neutral, and supports streaming. | It presents a steeper learning curve than simpler protocols requiring additional tooling. |
WebSocket | Enables full-duplex communication over a single TCP connection. | Real-time web applications, chat applications, and collaborative editing | It offers real-time communication and efficient use of network resources. | It is more complex to implement than HTTP and may not be supported by all clients/servers. |
You might’ve heard of client-server or peer-to-peer models before, but you might not know what they are used for. Trust me, I went through the same thing. They both look and sound similar, but let me explain the difference.
In the client-server model, communication is structured with distinct roles for clients and servers. Clients request services or resources, while servers provide these services or resources. This centralized approach is prevalent in many applications, such as web services, email, and databases. The client-server model simplifies management and scaling, but it can also create a single point of failure if the server goes down.
Conversely, the peer-to-peer (P2P) model distributes the roles more evenly among participating devices, known as peers. Each peer can act as a client and a server, sharing resources directly with others. This decentralization enhances redundancy and resilience, making P2P networks ideal for file sharing, blockchain, and collaborative applications. However, due to its distributed nature, P2P can be more complex to manage.
The client-server model laid the foundation for the internet and distributed systems, giving us architectural styles such as
Continue learning with this excellent course on computer network fundamentals: Grokking Computer Networking for Software Engineers.
Combining parallel computing concepts with computer networks gives us the basic foundation for distributed systems.
In today’s interconnected world, the software we rely on is rarely confined to a single machine. Instead, it spans multiple computers, working together in a coordinated dance to deliver the expected services. These are distributed systems, and their unique characteristics shape how they are built, maintained, and used.
A distributed system is a collection of interconnected computers working together as a unified system. These systems can be found everywhere: from the cloud infrastructure powering your favorite social media platform to the network of sensors controlling your smart home.
Distributed systems have several key characteristics that distinguish them from traditional, single-machine systems. These characteristics present challenges and opportunities for developers and users alike.
This is the narrative of a System Design interview. We want to design systems that are available, scalable, consistent, performant, and more. The place where most candidates lack is that they claim their system has all these characteristics but fail to justify how.
One of the primary benefits of distributed systems is their ability to scale. Adding more computers to the system can increase its capacity to handle more users, data, or transactions. This scalability allows distributed systems to adapt to growing demands, ensuring a smooth user experience even as the workload increases.
Distributed systems are designed to be highly available, meaning they should be accessible and operational even if some individual computers fail. This is achieved through redundancy and fault tolerance, where data and services are replicated across multiple machines.
To achieve scalability and availability, distributed systems often use replication (copying data across multiple machines) and sharding (dividing data into smaller pieces and distributing them across machines). These techniques help to distribute the load and ensure data remains accessible even in the face of failures.
Distributed systems struggle to ensure data consistency across multiple machines. Strict consistency guarantees that every read sees the most recent write, even if it means sacrificing some availability. Eventual consistency prioritizes availability, allowing for temporary inconsistencies that are eventually resolved.
In a distributed system, communication between computers takes time, introducing latency. This can impact the system’s overall performance. Therefore, careful design and optimization are necessary to minimize latency and ensure a responsive user experience.
Distributed systems often involve multiple processes or threads running concurrently across different machines. This can lead to complex interactions and race conditions, where the outcome of an operation depends on the timing of events. Coordination mechanisms are necessary to ensure that the system behaves correctly in concurrency.
Race conditions are a common problem when creating parallel applications. Knowing their mitigation strategies, such as locks, mutex, semaphores, etc., is imperative.
Due to their larger attack surface, distributed systems can be more vulnerable to security threats than single-machine systems. Implementing robust security measures, such as
Monitoring the health and performance of a distributed system is essential to identify and address issues before they impact users. Observability tools provide insights into the system’s behavior, helping operators understand what’s happening and why.
Distributed systems are complex, and failures are inevitable. A resilient system is designed to withstand failures and recover gracefully, minimizing downtime and data loss. Effective error-handling mechanisms are essential for ensuring the system’s reliability.
The 2010 Flash Crash, where the stock market plunged and rebounded rapidly due to algorithmic trading errors, highlights distributed systems’ potential risks and complexities.
Your one-stop shop to master Distributed Systems
This course is about establishing the basic principles of distributed systems. It explains the scope of their functionality by discussing what they can and cannot achieve. It also covers the basic algorithms and protocols of distributed systems through easy-to-follow examples and diagrams that illustrate the thinking behind some design decisions and expand on how they can be practiced. This course also discusses some of the issues that might arise when doing so, eliminates confusion around some terms (e.g., consistency), and fosters thinking about trade-offs when designing distributed systems. Moreover, it provides plenty of additional resources for those who want to invest more time in gaining a deeper understanding of the theoretical aspects of distributed systems.
Understanding distributed systems’ diverse challenges and opportunities sets the stage for the crucial next step: defining the system requirements. After all, a well-designed system begins with a clear vision of what it needs to achieve.
Just as an architect needs a detailed blueprint before constructing a building, software developers need a well-defined set of system requirements before building a software system. These requirements outline what the system should do (functional requirements) and how well it should do (non-functional requirements).
This is the first design step in any real-world System Design interview. If you do this well, you will have a much higher chance of acing your interview. Learn more about the requirement elicitation process in this course.
Functional requirements: These describe the specific features and capabilities that the system must provide. They answer the question, “What should the system do?” For example, a functional requirement for a social media platform might be “Users should be able to post comments on other users’ posts.”
Non-functional requirements: Also called quality concerns, these describe the system’s quality attributes, such as performance, reliability, and security. They answer, “How well should the system do it?” For example, a non-functional requirement for the same social media platform might be, “The system should be able to handle 10,000 concurrent users with an average response time of less than 2 seconds.”
An easy way to remember the difference is that functional requirements are “what the system does,” while non-functional requirements are “how well the system does what it does.”
Think of functional requirements as the user’s wish list. These are the features that directly address their needs and goals. For example, in a social media app, functional requirements might include:
The ability to create a profile.
The ability to post messages, photos, and videos.
The ability to follow other users.
The ability to like and comment on posts.
Identifying these requirements involves close collaboration with stakeholders, including users, product managers, and business analysts. User stories, use cases, and other techniques can help capture and prioritize functional requirements.
Pro tip: Always prioritize user needs and business goals when defining system requirements. A perfectly designed system that doesn’t solve the right problem is ultimately useless.
Non-functional requirements are often overlooked, but they are just as critical to the success of a system as functional requirements. These requirements define the qualities that make the system usable, reliable, and efficient. Examples of non-functional requirements include:
Performance: How fast the system should respond to requests.
Scalability: How well the system should handle increased workload.
Availability: How often the system should be accessible?
Security: How well the system should protect data and prevent unauthorized access.
Usability: How easy the system should be to use.
Determining non-functional requirements requires a deep understanding of the system’s context, its users, and the environment in which it will operate. It’s also important to be realistic and avoid setting unattainable goals.
Scoping is the process of defining the boundaries of your project. It involves deciding which requirements are in scope (i.e., will be addressed) and which are out of scope (i.e., will not be addressed). This is a crucial step, as it helps to manage expectations and prevent scope creep, where the project grows uncontrollably due to the addition of new requirements.
Note: This is essential for a successful System Design interview. If you can balance the scope of functional and non-functional requirements with the time constraint of the interview and the seniority of the position you’re applying for, you are bound for success.
It’s important to strike a balance when scoping functional and non-functional requirements. You don’t want to overload the system with too many features, but you don’t want to sacrifice essential qualities like performance or security.
In an ideal world, you could have a feature-rich, highly performant, scalable, and secure system. But in reality, tradeoffs are often necessary. For example, increasing the level of security might impact performance, or adding more features might make the system harder to use.
The key is prioritizing the most important requirements and making informed decisions about acceptable tradeoffs. This requires careful consideration of the system’s goals, users, and the available resources.
Note: This is another key aspect that will distinguish you in an interview. A seasoned engineer is expected to be able to weigh out the tradeoffs between different requirements, develop valid reasoning, and implement the right ones in your design.
In the end, defining system requirements is a complex but essential task. It requires clear communication, careful analysis, and the ability to make tough decisions. But when done well, it sets the stage for a successful project that delivers real value to its users.
As we wrap up this System Design primer, remember:
System Design isn’t about code: It’s about understanding user needs, crafting a vision, and making informed decisions about bringing that vision to life.
The fundamentals are key: Mastering concepts like operating systems, networking, distributed systems, and requirement definition lays the groundwork for tackling more complex design challenges.
Tradeoffs are inevitable. There is no such thing as a perfect system. The art lies in making informed compromises that prioritize the most important aspects.
To continue your learning journey, here are some highly recommended resources:
In my upcoming blogs, I’ll dissect real-world architectures, exploring how they tackle the complexities of scalability, reliability, and performance. You’ll also gain practical strategies and insights to help you confidently approach System Design interviews. For now, check out the top-rated System Design preparation course, Grokking Modern System Design Interview Get ready to level up your understanding and become a true System Design pro!
Ready to test your System Design skills? Try our AI Mock Interviewer and see how you can improve your preparation.
Free Resources