Design of Google Colossus (GFS-II)
Learn how Google scaled its file system from petabytes to exabytes with Colossus, an extension of the Google File System (GFS).
New requirements
Colossus is a file system developed by Google intended to tackle the escalating data requirements that were beyond GFS’s capabilities.
Question: What do you think is the limiting component of GFS due to which Google designed Colossus?
Scalability to exabytes
GFS was built to meet the requirement of storing a few million files with a total of hundreds of terabytes of data. For this scale, having a single manager with some workload optimizations was sufficient. It helped Google develop a file system for its users much more rapidly, which wouldn’t have been possible with a distributed manager-based design. Distributed models are complex, and it takes time to design such systems.
All of Google’s applications, including Google Cloud Storage services, use Google’s own file system. The growing number of Google’s applications and Cloud users has led to massive growth in data needs, requiring a file system that can scale to multiple petabytes (exabytes). The metadata storage requirements also grow with the volume of data. GFS can’t scale to exabytes due to a single manager.
Due to the accelerating volume of data generated by an increasing number of applications, it was not possible to manage data at such a big scale by using GFS as the underlying file system for all applications.
Low latency
At the time GFS was being developed, most workloads had high throughput requirements rather than low latency. Therefore, Google focused on providing a high throughput file system, but that design is not good for latency-sensitive applications like video gaming, online meetings (including Google Meet), video serving, etc. These applications require a response in real time. The GFS metadata service can go down for up to a minute because of a single point of failure. This downtime is not a significant problem for batch-oriented applications that require high throughput and can bear a latency of a ...