...

/

Case Study: S3

Case Study: S3

How do we design a modern service like S3.

What is a blob store?

Blob store is a binary object store that lets developers store unstructured data in key-value pairs in the cloud. Blobs are grouped into containers that are tied to user accounts. Each bucket is like a new database, with keys being your folder path and values being the binary objects (files). The data can be accessed from anywhere in the world and can include audio, video, and text.

Example: Amazon S3

We will focus on the design of the Amazon S3.

Amazon S3: Requirements & challenges

Functional requirements

  • Multitenancy
    • Multiple people can create multiple accounts and upload their files into the system
    • We don’t want our system to be separately deployed for every person or every account. The same system should be available for every customer.
    • The users should be able to view all their files on their respective consoles.
    • A single system should handle all the customers.
  • Virtual hosting style access of the files or the data
  • Path hosting style access of the files or the data

Non functional requirements

  • Durability (99.999 %): highly durable. We want our system not to lose the user’s files.
  • Availability (99.99 %): we want our system to be highly available.
  • Scalability: The system should scale with the increasing amount of data uploads so that customers can keep on uploading the data.
  • Region-specific buckets
    • Allow the users to create a bucket in a specific region and upload files to that bucket.
    • Allow the users to access the bucket from the same region.
  • Security: SSL, provide secure layer of access

Distributed systems design principles used to meet S3 requirements

  • Decentralization

    • to remove scaling bottlenecks
    • to avoid single points of failure
  • Asynchrony

    • To let the system make progress under all circumstances
  • Autonomy

    • To add independence between the system components to make decisions based on its local information.
  • Local responsibility

    • Each component is responsible for achieving its consistency; this is never the burden of its peers.
  • Controlled concurrency

    • Operations are designed such that no or limited concurrency control is required.
  • Failure tolerant

    • The system considers the failure of components to be a normal mode of operation and continues operation with no or minimal interruption.
  • Controlled parallelism

    • Abstractions used in the system are of such granularity that parallelism can be used to improve the performance and robustness of recovery or the introduction of new nodes.
  • Decompose into small well-understood building blocks

    • Do not try to provide a single service that does everything for everyone, but instead build small components that can be used as building blocks for other services.
  • Symmetry

    • Nodes in the system are identical in terms of functionality and require no or minimal node-specific configuration to function.
  • Simplicity

    • The system should be made as simple as possible (- but no simpler).

How S3 stores unstructured data?

To store your data in Amazon S3, you work with resources known as buckets and objects. A bucket is a container for objects. An object is a file and any metadata that describes that file.

To store an object in Amazon S3, you create a bucket and then upload the object to a bucket. When the object is in the bucket, you can open it, download it, and move it. When you no longer need an object or a bucket, you can clean up your resources.

The simplest blob store

A single server with some Terabytes of storage let say 30 TB and a set of APIs exposed to the client to create a bucket, upload files to the bucket, access files from the bucket, or list the files in the bucket, etc.

The client can create the bucket using the create bucket API, which makes a folder for that bucket on the hard disk.

After creating the bucket, the client can upload files to the bucket using upload files API. The files will be uploaded to the folder created on the hard disk.

Similarly, clients can access files using read files API, and list files inside a bucket using list files API.

Problem with the simplest blob store

  • The APIs can handle a limited amount of client requests because there is an I/Ospeed of data transfer between the hard disk and the RAM limit on servers’ hard disk.

  • A single server can handle a limited number of APIs.

  • A single server can’t handle a lot of writes because the hard disk will choke.

This simplest blob store system is not scalable.

Brute-force solution: horizontal scaling

One possible solution to scale the above simplest blob store is to add one more server that will double the

  • capacity of traffic handling
  • storage

Now we have a total of 60 TB of storage.

Problem with the horizontally scaled blob store

Horizontal scaling for the simplest blob store is not a reliable solution because It’s not necessary that the file is being accessed from the same server where it was uploaded.

For example, It might be possible that a client’s create bucket request is handled by server 1 and the upload files request is handled by server 2, server 2 wouldn’t find the bucket for that client in server 2’s storage, and couldn’t upload the files. Similarly, if the files were successfully uploaded in case ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy