Detailed Design of GFS
Study the underlying assumptions of GFS, the design pattern, and the reasons behind different design decisions.
Assumptions
GFS was designed to match the needs of Google’s own applications. It is based on the following observations and assumptions of its applications:
- The system is made up of low-cost commodity machines that frequently fail. Our system must continuously monitor itself to detect, tolerate, and recover from frequent component failures.
- Multi-GB files are more common, so the system must be optimized to manage huge files more efficiently than small files. There would be a few million files in total, each of which is 100 MBs or larger.
- Large streaming reads and sequential writes are the more common cases for targeted applications. A large streaming read reads one MB or more; this is also the case for sequential writes. The system should be efficient in managing these operations. Small random reads (a few KBs) and writes should also be supported, although the system doesn’t need to be optimized for these operations.
- Hundreds of clients can concurrently append data to the same file. The system should provide well-defined semantics for concurrent appends.
- The majority of target applications necessitate bulk data processing at a high rate, which requires high throughput rather than low latency. A few applications, however, have low latency requirements for each read and write operation.
Design pattern
In the first lesson, we discussed the GFS architecture at a high level with the following illustration. Let's discuss it in detail considering the assumptions above, and we will see why the GFS was designed in this way.
A single storage server, even with a large amount of storage space, wouldn't work to cope with the following requirements:
Storing millions of files read and written by hundreds of clients simultaneously
High throughput for applications processing huge datasets
We have already seen the limitations of single server-based systems in the first lesson. As per our assumption, we have to build our system with commodity machines, so we can't set up storage attached networks (SANs) that require specialized hardware and are very expensive. In GFS, we are using a large number of commodity machines to share the request and the data load. Each machine runs a Linux file system to manage the storage attached to it.
To manage ...