Design of the Google File System (GFS)
Learn how Google designed its first file system, GFS.
We'll cover the following...
The problem that GFS solved
Google needed a distributed file system that could horizontally scale in terms of storage and read/write
Google File System (GFS) is a distributed file system designed to store and process substantial amounts of data by utilizing a storage cluster of commodity servers. GFS aims to fulfill the following objectives:
In terms of scale, GFS should be able to store large files where collective need can reach tens of petabytes. Hundreds of concurrent clients might be using the storage system at any instance.
GFS API
GFS supports two specialized operations, record append and snapshot, along with other basic file operations shown in the following illustration. The record append operation allows multiple clients to append small records to a file concurrently, while ensuring that records from different clients are not interleaved, maintaining the consistency of individual records. The snapshot operation allows the clients to create a copy of a file or a directory tree at a low cost.
GFS design is optimized for large, batch-oriented workloads. Most files are mutated by appending new data to them rather than overwriting existing data. So, it focuses on providing the atomicity guarantee for append operations. The snapshot operation helps capture the consistent state of the file system, which helps in backup and recovery. The copy-on-write approach used in snapshots ensures that only changes from the original data need to be stored, optimizing space utilization.
Let’s delve into the design of GFS that empowers it to accomplish the objectives mentioned above.
Design of GFS
GFS is built on a cluster of commodity servers and basically consists of two programs: a manager program and a chunkserver program. The server that runs the manager program is called the GFS manager, while the server that runs the chunkserver program is called a chunkserver. In GFS, there is a single manager and a large number of chunkservers, as shown in the following illustration.
- The client is a GFS application program interface through which the end users perform the directory or file operations.
- A chunk is a data storage unit in GFS. Each file in GFS is split into fixed-size (64 MB) chunks. A manager assigns each chunk a 64-bit globally unique ID called the chunk handle. It also allocates three chunkservers to store three replicas of each chunk. Three is a replication factor in GFS by default. However, it is configurable.
-
The manager is like an administrator that manages the file system metadata, including namespaces, file-to-chunk mapping, and chunk location. The metadata is stored in the manager’s memory for good performance. For a persistent record of the metadata, the manager logs ...