...

/

Design Considerations in a Blob Store

Design Considerations in a Blob Store

Let’s look into the different design aspects of the blob store in more detail.

Introduction

Even though we discussed the design and its major components in detail in the previous lesson, a number of interesting questions need answering. For example, how do we store (large) blobs? In the same disk, in the same machine or divide it into chunks? How many replicas of a blob should be made to ensure reliability and availability? How to search and retrieve blobs quickly, etc.

This lesson answers such important design concerns. The table below summarizes the goals of this lesson.

Summary of the Lesson

Section

Purpose

Blob metadata

What metadata is maintained to ensure efficient storage and retrieval of blobs

Partitioning

How blobs are partitioned among different data nodes

Blob indexing

How to efficiently search for blobs

Pagination

How to conceive a method for retrieval of a limited number of blobs to ensure improved readability and loading time

Replication

How to replicate and how many copies to maintain to improve availability

Garbage collection

How to delete blobs without sacrificing performance

Streaming

How to stream large files chunk-by-chunk to facilitate interactivity for user

Caching

How to improve response time and throughput

Before we answer the questions above, let’s look at how we create layers of abstractions for the user to hide the internal complexity of the blob store. These abstraction layers help us with design decisions as well.

We have three layers:

  1. User account: Users uniquely get identified on this layer through their account_ID. Blobs uploaded by users are maintained in their containers.
  2. Container: Each user has a set of containers uniquely identified by a container_ID. These containers contain blobs.
  3. Blob: This layer contains information about blobs that are uniquely identified by blob_ID. This layer maintains information about the metadata of blobs that is vital for achieving the availability and reliability of the system.

We can take routing, storage, and sharding decisions on the basis of these layers. The table below summarizes the layers.

Layered Information

Level

Uniquely identified by

Information

Sharded by

Mapping

User Blob Store Account

account_ID

list of containers_ID's

account_ID

account -> list of containers

Container

container_ID

list of blob ID's

container_ID

container -> list of blobs

Blob

blob_ID

{list of chunks, chunkInfo: data node ID's,.. }

blob_ID

blob -> list of chunks

We generate unique IDs for user accounts, containers, and blobs using a uniqueID generator.

Besides storing the actual blob data, we have to maintain some metadata for managing the blob storage. Let’s see what data it is.

Blob metadata

When a user uploads a blob, it is split into small size chunksA chunk is the minimum unit of data for writing and reading to have support for storing large files that couldn’t fit in one contiguous location or in one data node or in one block of a disk associated with the data node. The chunks for a single blob are then stored on different data nodes that have the storage space available to store the chunks. There are billions of blobs that are being stored on the storage. The master node has to ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy