Home/Blog/System Design/System design primer: Learn the basics of system design

System design primer: Learn the basics of system design

22 min read

Jan 13, 2025

content

Operating systems fundamentals

Concurrency

Synchronization

Computer network essentials

Transport layer protocols

Application layer protocols

Web API architectures

Remote procedure call frameworks

gRPC: A modern take on RPC

Comparison of the network protocols

Communication models

Distributed systems

Scalability

Availability

Replication and sharding

Consistency

Latency and performance

Concurrency and coordination

Security and privacy

Monitoring and observability

Resilience and error handling

Defining system requirements

Functional vs. Non-functional requirements

Identifying functional requirements

Identifying non-functional requirements

Scoping functional vs. non-functional requirements

Tradeoffs

Conclusion

Resources for further exploration

Next steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

In this primer, we’ll unravel the mysteries of System Design, explore its foundational principles and the interview process, and equip you with the knowledge and confidence to grok your next System Design interview.

Of course, there are hundreds of terms and concepts in System Design. However, based on my experience as both a candidate and an interviewer in several System Design interviews, the following topics are considered the highest priority.

Operating systems fundamentals#

Operating systems (OS) are the backbone of modern computing, but their role extends far beyond simply running applications. For System Designers, a deep understanding of OS internals is crucial. This includes knowledge of process management, memory allocation, concurrency models, and file systems. Familiarity with different OS architectures (e.g., monolithic, microkernel) and their performance implications can also be invaluable in making informed design decisions.

The ultimate guide to Operating Systems

Operating Systems: Virtualization, Concurrency & Persistence

When it comes to operating systems, there are three main concepts: virtualization, concurrency, and persistence. These concepts lay the foundation for understanding how an operating system works. In this extensive course, you'll cover each of those in its entirety. You'll start by covering the basics of CPU virtualization and memory such as: CPU scheduling, process virtualization, and API virtualization. You will then move on to concurrency concepts where you’ll focus heavily on locks, semaphores, and how to triage concurrency bugs like deadlocks. Towards the end, you'll get plenty of hands-on practice with persistence via I/O devices and file systems. By the time you're done, you'll have mastered everything there is to know about operating systems.

40hrs

Intermediate

116 Playgrounds

39 Quizzes

A deep understanding of operating system (OS) concepts is non-negotiable in System Design. From scheduling and resource allocation to process management and concurrency models, these foundational principles empower you to craft robust, efficient, and scalable systems. You’ll gain a competitive edge when answering System Design Interview Questions by mastering how your design components interact within an OS environment and how user requests are handled. Let’s dive into the essential OS knowledge you need to elevate your System Design skills.

Concurrency#

Concurrency, the simultaneous execution of multiple tasks, is a fundamental challenge and opportunity in System Design. While processesA single instance of a program and threadsProcesses run on threads, often called the smallest unit of execution are the building blocks, the true mastery lies in understanding how to orchestrate them effectively to achieve optimal performance, reliability, and responsiveness.

“Concurrency is a way to structure a software system by decomposing it into components that can be executed independently.” — Rob Pike, co-creator of the Go programming language.

Modern computing environments, especially distributed systems like web farms and cloud infrastructure, rely heavily on concurrency to achieve scalability and performance. However, this concurrency necessitates robust thread synchronization and coordination mechanisms to ensure data consistency, prevent race conditionsA race condition occurs when two or more threads can access shared data and they try to change it at the same time., and manage resource contention.

Locks: This is the most fundamental synchronization primitive. Various types of locks exist, including mutexes (mutual exclusion locks), read-write locks, and spinlocks, each with different performance characteristics and use cases.
Semaphores: These are generalized locks that allow a limited number of threads to access a resource concurrently. Semaphores are often used to control access to a pool of resources or to implement producer-consumer scenarios.
Condition variables: These enable threads to wait for a specific condition to become true before proceeding. Condition variables are typically used with locks to create more complex synchronization patterns.
Barriers: These are synchronization points where threads wait until all threads in a group have reached the barrier before continuing. Barriers are useful for coordinating the execution of parallel tasks that must be completed in phases.

However, there is one common issue with concurrency: synchronization.

Synchronization#

When two independent processes run in parallel, synchronization is not a problem (they do not share resources). However, when dealing with dependent processes, synchronizing them is the key to successful concurrency.

Synchronization is the cornerstone of concurrent programming. It’s the art and science of coordinating access to shared resources among multiple processes. Just as traffic lights manage the flow of vehicles, synchronization mechanisms like mutexes (mutual exclusion locks), semaphores, condition variables, and monitors, each with unique strengths and use cases, regulate access to shared data. Choosing the right mechanism ensures correctness and performance in concurrent systems.

We use interprocess communication (IPC)Interprocess communication is a mechanism provided by the operating system that allows processes to communicate and synchronize their actions through methods like shared memory and message passing. to synchronize and communicate messages between different processes. This is where the “traffic lights” come in. One process can broadcast a message to the others about using/locking a resource or not using that resource anymore.

After conducting numerous technical interviews, I observed that operating systems is an area where many candidates struggle.

Distributed systems introduce additional challenges due to the lack of shared memory and the potential for network delays and failures. Several algorithms have been developed to address these challenges:

Distributed locks: These implement mutual exclusion in a distributed environment. Common approaches include centralized coordinator-based locks, token-based locks, and quorum-based locks.
Lamport’s logical clocks: These provide a way to order events in a distributed system without relying on physical clocks. Logical clocks are used to ensure causal ordering and to detect inconsistencies.
Vector clocks: These are an extension of Lamport’s logical clocks that provide a more accurate representation of causality in distributed systems. Vector clocks are used in applications like conflict resolution in distributed databases and collaborative editing tools.
Paxos and Raft: These are consensus algorithms that enable a group of distributed processes to agree on a single value even in the presence of failures. Paxos and Raft are used in distributed databases, file systems, and other systems requiring strong consistency guarantees.

While modern operating systems excel at managing concurrent processes on a single machine, their capabilities are inherently confined to that one system. This limitation can be overcome by harnessing the power of computer networks, enabling the distribution of processes across multiple machines, and unlocking new levels of parallelism and scalability.

Continue reading about operating systems with this fundamentals course: Operating Systems: Virtualization, Concurrency & Persistence.

Computer network essentials#

When I started out, I always wondered how computers communicate so well over such vast distances. Then, I discovered that this was a whole field of study called computer networks. You might ask, what is that? A computer network is a collection of interconnected devices that can share information and resources. It is a way for two independent systems to communicate. They are like IPC but for separate machines. The machines that need to communicate are usually connected through a LANLocal Area Network, and messages are sent through the LAN infrastructure.

How nodesThese are the individual devices connected to the network, such as computers, servers, routers, and switches. are interconnected defines the network’s topology, which can significantly impact performance, reliability, and cost. Common topologies include:

Bus topology: A single cable (the “bus”) connects all nodes linearly. It is simple and inexpensive but vulnerable to single points of failure.
Star topology: Each node connects to a central hub or switch. Offers better fault tolerance than bus topology but requires more cabling.
Ring topology: Nodes are connected in a circular chain, with data flowing in one direction. Offers good fault tolerance and predictable performance but can be complex to manage.
Mesh topology: Nodes are interconnected with multiple redundant links. Highly reliable and fault-tolerant but expensive to implement.

When sending messages through the network, we must be very specific in crafting those messages. These messages must pass through several hurdles (routers, switches, etc.), and a message meant for machine X mustn’t arrive at machine Y. You can see why this would confuse the machines if it were to happen.

Computer networks form the backbone of the internet, the infrastructure for System Design. Knowing the concepts of computer networks put me above the average candidate in System Design interviews as I could go deeper into the implementation details, which, to an interviewer, shows the depth of your understanding.

The OSI (Open Systems Interconnection) model divides network tasks Network tasks are the various operations and processes involved in that facilitate communication and data exchange between devices across a network. These tasks include data encapsulation, addressing, routing, error detection, and flow control.into seven layers. Each layer has a specific job, from the physical wires (Layer 1) to your application (Layer 7). It’s like a layer cake, with each layer depending on the one below it. This rounds out the model of inter-machine communication for both LANs and cross-network (sometimes called WANWide Area Network) applications. Let’s look at an example of how all of this works.

To achieve this, the two key protocols for network-based communication are TCP (Transmission Control Protocol) and UDP (User Datagram Protocol).

Transport layer protocols#

If computer networks are like the highways of the digital world, then network communication protocols are the traffic rules. They govern how data travels, ensuring it reaches its destination efficiently and accurately. Let’s delve into a couple of these protocols and explore how they shape our online experiences.

Pro tip: Consider your application’s specific requirements when choosing a network protocol. Do you need high-speed data transfer? Reliability? Real-time communication?

Think of TCP as the postal service—it ensures your data arrives reliably, in the right order, and with a confirmation. UDP is more like an announcer, quickly sharing information without checking if everyone heard it. TCP is used for things like email and web browsing, where accuracy is vital. UDP is used for live video streaming, where speed is more important than perfect delivery.

Now, what if you want two devices on separate networks to communicate? Let’s look at application layer protocols.

Application layer protocols#

HTTP (Hypertext Transfer Protocol) is your browser’s language for talking to websites. It allows you to request web pages and send data back. In this scenario, your machine is the browser, requesting information from another machine elsewhere (sometimes across the globe). It’s a bit like a conversation, with each request and response forming a step in the dialogue. Whenever you have cross-network communication, you can be sure that HTTP is in play.

This protocol started the revolution for apps on the internet. Everything from social media and emails to online games uses HTTP. HTTPSHypertext Transfer Protocol Secure builds on this foundation by providing a secure, encrypted connection, ensuring data privacy and integrity for all internet communications.

FTP (File Transfer Protocol) and SMTP (Simple Mail Transfer Protocol), two other fundamental protocols, also play crucial roles in shaping the internet as we know it today. FTP enables the transfer of files between servers and clients, facilitating the sharing and distribution of digital content. SMTP, on the other hand, is responsible for the delivery of emails, allowing users to communicate with each other across the globe.

When designing APIs in a System Design interview, one of the most asked questions is what version of HTTP you would use (HTTP/1.1 or HTTP/2.0) and why.

With the advent of the internet, communication between different machines became essential. Everyone started to find better and safer ways to communicate between networks. That’s how we got RPCs and APIs.

Web API architectures#

I had always heard of terms like REST and GraphQL as the “essentials for any web application.” This confused me, as REST was simply a framework for building APIsApplication Programming Interface. When I started using these frameworks, I realized how robust they are.

REST, or Representational State Transfer, is an architectural style for designing networked applications. It relies on a stateless, client-server communication model and leverages HTTP protocols, making it a highly scalable and flexible approach for building APIs. This simplicity and wide adoption have made REST a cornerstone of modern web development, facilitating seamless integration between diverse systems.

Having built several APIs, I have found that REST is particularly effective for public APIs due to its simplicity and statelessness, which enhances scalability and reliability.

GraphQL, on the other hand, is a query language for APIs and a runtime for executing those queries by using a type system you define for your data. It offers a more efficient, powerful, and flexible alternative to REST. With GraphQL, clients can request the data they need, reducing over-fetching and under-fetching of data.

These frameworks were so essential for web development and distributed systems that they caused a paradigm shift. We now have entire system architectures (System Designs) built to support REST or GraphQL.

When I mastered these concepts, I felt I had learned what I needed to know. Then I heard about RPCs and how they are superior in some applications. I scratched my head because I thought REST could do anything. Let me tell you why RPCs change the game.

Remote procedure call frameworks#

Remote procedure calls (RPC) allow a program on one computer to execute a function or procedure on another computer as if it were running locally. The request message is sent to the remote computer, which executes the requested function and sends back a response message.

RPCs are used in various applications, from distributed file systems to web services. They simplify the development of distributed systems by abstracting away the details of network communication.

gRPC: A modern take on RPC#

Google remote procedure calls (gRPC) is a high-performance, open-source RPC framework developed by Google. It leverages modern technologies like Protocol BuffersThis is a language-neutral way to serialize structured data. and HTTP/2This is a faster, more efficient version of the HTTP protocol..

gRPC offers several advantages over traditional RPC:

Performance: gRPC is designed for high throughput and low latency, making it ideal for modern microservices architectures.
Language: gRPC supports various programming languages, allowing you to build distributed systems using your preferred tools.
Streaming: gRPC supports client-side and server-side streaming, enabling efficient communication for real-time applications.

Note: While gRPC offers numerous advantages, it’s important to consider its learning curve and additional tooling requirements before adopting it.

Comparison of the network protocols#

There are many different network communication protocols, each designed for specific purposes. The most popular ones are listed below.

Protocol	Description	Use Cases	Strengths	Weaknesses
TCP	Provides reliable, ordered delivery of data.	Web browsing, email, and file transfer	It is reliable and guarantees data delivery.	It is slower than UDP due to error checking and retransmission overhead.
UDP	Provides fast, connectionless delivery of data.	Video streaming, online gaming, and DNS lookups	It is fast, with low latency.	It is unreliable and may drop packets, unsuitable for applications requiring guaranteed data delivery.
HTTP	Used for transferring hypertext (web pages).	Web browsing and APIs	It is simple and widely supported.	It is stateless, and each request/response is independent; it can be inefficient for frequent communication.
REST	Allows stateless communication between devices.	Public APIs, microservices, and web applications	It offers simplicity, ease of use, wide adoption, and scalability.	It presents an over-fetching/under-fetching issue.
GraphQL	Query language for APIs.	Complex client-server applications (mobile and web)	It offers precise data fetching, flexibility in evolving APIs, and strong typing.	It offers a steep learning curve, harder to implement caching.
RPC	Enables a program on one computer to execute a procedure on another computer.	Distributed systems, microservices, and remote file systems	It simplifies the development of distributed systems and abstracts network communication details.	If not implemented properly, it can be slower than local procedure calls and potential security vulnerabilities.
gRPC	A high-performance, open-source RPC framework.	Microservices, cloud-native applications, and real-time communication	It is efficient, language-neutral, and supports streaming.	It presents a steeper learning curve than simpler protocols requiring additional tooling.
WebSocket	Enables full-duplex communication over a single TCP connection.	Real-time web applications, chat applications, and collaborative editing	It offers real-time communication and efficient use of network resources.	It is more complex to implement than HTTP and may not be supported by all clients/servers.

You might’ve heard of client-server or peer-to-peer models before, but you might not know what they are used for. Trust me, I went through the same thing. They both look and sound similar, but let me explain the difference.

Communication models#

In the client-server model, communication is structured with distinct roles for clients and servers. Clients request services or resources, while servers provide these services or resources. This centralized approach is prevalent in many applications, such as web services, email, and databases. The client-server model simplifies management and scaling, but it can also create a single point of failure if the server goes down.

The client-server model laid the foundation for the internet and distributed systems, giving us architectural styles such as MVCModel View Controller, microkernel, and dozens of others. From this, we got stuff like load balancing (distributing incoming traffic), failover mechanisms (what to do when a server fails? i.e., backups), and monitoring/analytics.

Continue learning with this excellent course on computer network fundamentals: Grokking Computer Networking for Software Engineers.

Combining parallel computing concepts with computer networks gives us the basic foundation for distributed systems.

Distributed systems#

In today’s interconnected world, the software we rely on is rarely confined to a single machine. Instead, it spans multiple computers, working together in a coordinated dance to deliver the expected services. These are distributed systems, and their unique characteristics shape how they are built, maintained, and used.

A distributed system is a collection of interconnected computers working together as a unified system. These systems can be found everywhere: from the cloud infrastructure powering your favorite social media platform to the network of sensors controlling your smart home.

Learn how to build distributed GenAI applications!

Grokking the Generative AI System Design

This course will prepare you to design generative AI systems with a practical and structured approach. You will begin by exploring the foundational concepts, such as neural networks, transformers, tokenization, embedding, etc. This course introduces a 6-step SCALED framework, a systematic approach to designing robust GenAI systems. Next, through real-world case studies, you will immerse into the design of GenAI systems like text-to-text (e.g., ChatGPT), text-to-image (e.g., Stable Diffusion), text-to-speech (e.g., ElevenLabs), and text-to-video (e.g., SORA). This course describes these systems from a user-focused perspective, emphasizing how user inputs interact with backend processes. Whether you are an ML/software engineer, AI enthusiast, or manager, this course will equip you to design, train, and deploy generative AI models for various use cases. You will gain confidence to approach new challenges in GenAI and leverage advanced techniques to create impactful solutions.

4hrs

Intermediate

1 Quiz

116 Illustrations

Scalability#

One of the primary benefits of distributed systems is their ability to scale. Adding more computers to the system can increase its capacity to handle more users, data, or transactions. This scalability allows distributed systems to adapt to growing demands, ensuring a smooth user experience even as the workload increases.

Availability#

Distributed systems are designed to be highly available, meaning they should be accessible and operational even if some individual computers fail. This is achieved through redundancy and fault tolerance, where data and services are replicated across multiple machines.

Replication and sharding#

To achieve scalability and availability, distributed systems often use replication (copying data across multiple machines) and sharding (dividing data into smaller pieces and distributing them across machines). These techniques help to distribute the load and ensure data remains accessible even in the face of failures.

Consistency#

Distributed systems struggle to ensure data consistency across multiple machines. Strict consistency guarantees that every read sees the most recent write, even if it means sacrificing some availability. Eventual consistency prioritizes availability, allowing for temporary inconsistencies that are eventually resolved.

Latency and performance#

In a distributed system, communication between computers takes time, introducing latency. This can impact the system’s overall performance. Therefore, careful design and optimization are necessary to minimize latency and ensure a responsive user experience.

Concurrency and coordination#

Distributed systems often involve multiple processes or threads running concurrently across different machines. This can lead to complex interactions and race conditions, where the outcome of an operation depends on the timing of events. Coordination mechanisms are necessary to ensure that the system behaves correctly in concurrency.

Race conditions are a common problem when creating parallel applications. Knowing their mitigation strategies, such as locks, mutex, semaphores, etc., is imperative.

Security and privacy#

Due to their larger attack surface, distributed systems can be more vulnerable to security threats than single-machine systems. Implementing robust security measures, such as authenticationThe process of verifying the identity of a user, device, or system attempting to access a network or resource, ensuring they are who they claim to be., authorizationThe process of determining and granting appropriate permissions and access levels to authenticated users, devices, or systems based on their roles and privileges., and encryptionThe process of converting plain text or data into a scrambled, unreadable format (ciphertext) using a cryptographic algorithm and key to protect its confidentiality., is crucial to protect the system and its data from unauthorized access.

Monitoring and observability#

Monitoring the health and performance of a distributed system is essential to identify and address issues before they impact users. Observability tools provide insights into the system’s behavior, helping operators understand what’s happening and why.

Your one-stop shop to master Distributed Systems

Distributed Systems for Practitioners

This course is about establishing the basic principles of distributed systems. It explains the scope of their functionality by discussing what they can and cannot achieve. It also covers the basic algorithms and protocols of distributed systems through easy-to-follow examples and diagrams that illustrate the thinking behind some design decisions and expand on how they can be practiced. This course also discusses some of the issues that might arise when doing so, eliminates confusion around some terms (e.g., consistency), and fosters thinking about trade-offs when designing distributed systems. Moreover, it provides plenty of additional resources for those who want to invest more time in gaining a deeper understanding of the theoretical aspects of distributed systems.

9hrs 30mins

Beginner

18 Quizzes

617 Illustrations

Understanding distributed systems’ diverse challenges and opportunities sets the stage for the crucial next step: defining the system requirements. After all, a well-designed system begins with a clear vision of what it needs to achieve.

Defining system requirements#

Just as an architect needs a detailed blueprint before constructing a building, software developers need a well-defined set of system requirements before building a software system. These requirements outline what the system should do (functional requirements) and how well it should do (non-functional requirements).

This is the first design step in any real-world System Design interview. If you do this well, you will have a much higher chance of acing your interview. Learn more about the requirement elicitation process in this course.

Functional vs. Non-functional requirements#

Functional requirements: These describe the specific features and capabilities that the system must provide. They answer the question, “What should the system do?” For example, a functional requirement for a social media platform might be “Users should be able to post comments on other users’ posts.”
Non-functional requirements: Also called quality concerns, these describe the system’s quality attributes, such as performance, reliability, and security. They answer, “How well should the system do it?” For example, a non-functional requirement for the same social media platform might be, “The system should be able to handle 10,000 concurrent users with an average response time of less than 2 seconds.”

An easy way to remember the difference is that functional requirements are “what the system does,” while non-functional requirements are “how well the system does what it does.”

Identifying functional requirements#

Think of functional requirements as the user’s wish list. These are the features that directly address their needs and goals. For example, in a social media app, functional requirements might include:

The ability to create a profile.
The ability to post messages, photos, and videos.
The ability to follow other users.
The ability to like and comment on posts.

Identifying these requirements involves close collaboration with stakeholders, including users, product managers, and business analysts. User stories, use cases, and other techniques can help capture and prioritize functional requirements.

Pro tip: Always prioritize user needs and business goals when defining system requirements. A perfectly designed system that doesn’t solve the right problem is ultimately useless.

Identifying non-functional requirements#

Non-functional requirements are often overlooked, but they are just as critical to the success of a system as functional requirements. These requirements define the qualities that make the system usable, reliable, and efficient. Examples of non-functional requirements include:

Performance: How fast the system should respond to requests.
Scalability: How well the system should handle increased workload.
Availability: How often the system should be accessible?
Security: How well the system should protect data and prevent unauthorized access.
Usability: How easy the system should be to use.

Determining non-functional requirements requires a deep understanding of the system’s context, its users, and the environment in which it will operate. It’s also important to be realistic and avoid setting unattainable goals.

Scoping functional vs. non-functional requirements#

Scoping is the process of defining the boundaries of your project. It involves deciding which requirements are in scope (i.e., will be addressed) and which are out of scope (i.e., will not be addressed). This is a crucial step, as it helps to manage expectations and prevent scope creep, where the project grows uncontrollably due to the addition of new requirements.

Note: This is essential for a successful System Design interview. If you can balance the scope of functional and non-functional requirements with the time constraint of the interview and the seniority of the position you’re applying for, you are bound for success.

It’s important to strike a balance when scoping functional and non-functional requirements. You don’t want to overload the system with too many features, but you don’t want to sacrifice essential qualities like performance or security.

Tradeoffs#

In an ideal world, you could have a feature-rich, highly performant, scalable, and secure system. But in reality, tradeoffs are often necessary. For example, increasing the level of security might impact performance, or adding more features might make the system harder to use.

The key is prioritizing the most important requirements and making informed decisions about acceptable tradeoffs. This requires careful consideration of the system’s goals, users, and the available resources.

Note: This is another key aspect that will distinguish you in an interview. A seasoned engineer is expected to be able to weigh out the tradeoffs between different requirements, develop valid reasoning, and implement the right ones in your design.

In the end, defining system requirements is a complex but essential task. It requires clear communication, careful analysis, and the ability to make tough decisions. But when done well, it sets the stage for a successful project that delivers real value to its users.

Conclusion#

As we wrap up this System Design primer, remember:

System Design isn’t about code: It’s about understanding user needs, crafting a vision, and making informed decisions about bringing that vision to life.
The fundamentals are key: Mastering concepts like operating systems, networking, distributed systems, and requirement definition lays the groundwork for tackling more complex design challenges.
Tradeoffs are inevitable. There is no such thing as a perfect system. The art lies in making informed compromises that prioritize the most important aspects.

Resources for further exploration#

To continue your learning journey, here are some highly recommended resources:

Next steps#

In my upcoming blogs, I’ll dissect real-world architectures, exploring how they tackle the complexities of scalability, reliability, and performance. You’ll also gain practical strategies and insights to help you confidently approach System Design interviews. For now, check out the top-rated System Design preparation course, Grokking Modern System Design Interview and Grokking the Frontend System Design Interview Get ready to level up your understanding and become a true System Design pro!

Ready to test your System Design skills? Try our AI Mock Interviewer and see how you can improve your preparation.

Frequently Asked Questions

How to learn system design primer?

The best way to learn system design primer is by understanding basic system design concepts like scalability, reliability and availability, and load balancing and then applying those theoretical concepts to creating simple system diagrams.

How do I prepare for System Design?

You can prepare for System Design by mastering it’s fundamentals. Once you have gained an understanding of how the main components interact, you can make trade-offs for a high-level design based on system requirements. With a strong grasp of the basics, you can approach complex designs methodically and effectively.

Written By:

Maryam Sulemani

New on Educative

Learn any Language for FREE all September 🎉

For the entire month of September, get unlimited access to our entire catalog of beginner coding resources.

🎁 G i v e a w a y

30 Days of Code

Complete Educative’s daily coding challenge every day in September, and win exciting Prizes.

Free Resources