Grokking the Modern System Design Interview/

...

Design Considerations of a Blob Store

Learn more details about the different design aspects of the blob store.

We'll cover the following...

Introduction
Blob metadata
Partition data
Blob indexing
Pagination for listing
Replication
- Synchronous replication within a storage cluster
- Asynchronous replication across data centers and region
Garbage collection while deleting a blob
Stream a file
Cache the blob store

Even though we discussed the design of the blob store system and its major components in detail in the previous lesson, a number of interesting questions still require answers. For example, how do we store large blobs? Do we store them in the same disk, in the same machine, or do we divide those blobs into chunks? How many replicas of a blob should be made to ensure reliability and availability? How do we search for and retrieve blobs quickly? These are just some of the questions that might come up.

This lesson addresses these important design concerns. The table below summarizes the goals of this lesson.

Summary of the Lesson

Section	Purpose
Blob metadata	This is the metadata that’s maintained to ensure efficient storage and retrieval of blobs.
Partitioning	This determines how blobs are partitioned among different data nodes.
Blob indexing	This shows us how to efficiently search for blobs.
Pagination	This teaches us how to conceive a method for the retrieval of a limited number of blobs to ensure improved readability and loading time.
Replication	This teaches us how to replicate blobs and tells us how many copies we should maintain to improve availability.
Garbage collection	This teaches us how to delete blobs without sacrificing performance.
Streaming	This teaches us how to stream large files chunk-by-chunk to facilitate interactivity for users.
Caching	This shows us how to improve response time and throughput.

Before we answer the questions listed above, let’s look at how we create layers of abstractions for the user to hide the internal complexity of a blob store. These abstraction layers help us make design-related decisions as well.

There are three layers of abstractions:

User account: Users uniquely get identified on this layer through their account_ID. Blobs uploaded by users are maintained in their containers.
Container: Each user has a set of containers that are all uniquely identified by a container_ID. These containers contain blobs.
Blob: This layer contains information about blobs that are uniquely identified by their blob_ID. This layer maintains information about the metadata of blobs that’s vital for achieving the availability and reliability of the system.

When a user uploads a blob, it’s split into small-sized chunksA chunk is the minimum unit of data for writing and reading. in order to be able to support the storage of large files that can’t fit in one contiguous location, in one data node, or in one block of a disk associated with that data node. The chunks for a single blob are then stored on different data nodes that have enough storage space available to store these chunks. There are billions of blobs that are kept in storage. The manager node has to store all the information about the blob’s chunks and where they are stored, so that it can retrieve the chunks on reads. The manager node assigns an ID to each chunk.

The information about a blob consists of chunk IDs and the name of the assigned data node for each chunk. We split the blobs into equal-sized chunks. Chunks are replicated to enable them to deal with data node failure. Hence, we also store the replica IDs for each chunk. We have access to all this information pertaining to each blob.

Let’s say we have a blob of 128 MB, and we split it into two chunks of 64 MB each. The metadata for this blob is shown in the following table:

Level	Uniquely identified by	Information	Sharded by	Mapping
User’s blob store account	`account_ID`	list of `container_ID` values	`account_ID`	Account -> list of containers
Container	`container_ID`	List of `blob_ID` values	`container_ID`	Container -> list of blobs
Blob	`blob_ID`	{list of chunks, chunkInfo: data node ID's,.. }	`blob_ID`	Blob -> list of chunks

Chunk	Datanode ID	Replica 1 ID	Replica 2 ID	Replica 3 ID
1	d1b1	r1b1	r2b1	r3b1
2	d1b2	r1b2	r2b2	r3b2

Distributed Cache System

Pub-Sub

Blob Store

TikTok

Uber Eats

NewsFeed

Facebook Messenger

ChatGPT

Design Considerations of a Blob Store

Introduction

Summary of the Lesson

Layered Information

Blob metadata

Blob Metadata