Data consistency

In a distributed file system, we keep multiple copies/replicas of file data for availability purposes. If such systems allow data mutationsA data mutation is a change in the data via random writes and appends., then data inconsistencies among replicas tend to appear for various reasons. For example, a node failure causes a mutation to fail on one of the replicas, making the data inconsistent among replicas; one replica will contain stale data while others have been updated. Clients reading data from multiple replicas will get different data, which can have consequences depending on the specific use case. Therefore, the file system that allows mutations should ensure data consistency.

If we have data that is consistent among all replicas, we can still face another problem due to concurrent writes on the same region. Since the file system allows multiple writers to write on the same region, the file region may mix data from multiple writes. Therefore, a file system should handle all these cases and provide its users with some guarantees so that the users don't see unexpected results. We can think of implementing a strong consistency model here to make the data consistent among all replicas. However, our system has to serve many requests at a time, and a system that provides strong consistency might compromise the system's performance. We need to provide consistency with good scalability and good performance. GFS's data consistency model is one of the most involved parts of the system. This increases its difficulty level. In this lesson, we will see what data consistency guarantees GFS offers its users and how it meets all their requirements. Before this, we need to know about the possible states of a file region after data mutation. Let's define these states first.

States of a file region after data mutation

The state of a file region can be consistent or inconsistent, and defined or undefined after a data mutation.

Consistent: A file region is consistent if a client sees the same data on all replicas after a mutation. In the illustration below, the left part shows that all the replicas have the same data. Thus, the clients reading from any of the replicas will read the same data.
Inconsistent: A file region is inconsistent if a client sees different data on multiple replicas. In the illustration below, the right part shows that one of the replicas has different data than the ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Relaxed Data Consistency Model

Data consistency

States of a file region after data mutation