System Design Deep Dive: Real-World Distributed Systems/

...

Database Operations in Spanner

Learn how read-write, read-only, and schema-change transactions work in detail.

We'll cover the following...

Read-write transactions
- Two-phase commit in Spanner
  - Non-coordinator role
  - Coordinator role
Read-only transactions
- Read within one Paxos group
- Read within multiple Paxos groups
Schema-change transactions
Quiz

In this lesson, we will learn about read-write, read-only, and schema-change transactions utilizing the timestamping mechanism.

Read-write transactions

A transaction's writes are buffered on the client side until the commit. Therefore, the results of a transaction's writes are not visible to subsequent reads inside the same transaction. This architecture is particularly well-suited to Spanner since uncommitted writes do not have timestamps assigned yet, and the timestamps of any data read are returned by a read transaction.

The following slides explain the read and write transactions.

Spanner uses the wound-wait approach to prevent deadlocks during reads within read-write transactions. Whenever a client requests up-to-date information, it sends the request to the group’s designated leader replica, acquiring the necessary read locks and retrieving the data. To avoid having its transaction timed out by the participant leaders, a client periodically sends keepalive messages while a transaction is still open. Then, the client finishes all reads and writes data to its write buffer.

Two-phase commit in Spanner

Spanner uses a two-phase commit (2PC) to guarantee isolation and strong consistency. The 2PC begins once a client has finished all the reads and has written data to its write buffer.

If participants in a 2PC are physically nearby, the latency for data propagation will be lower. Spanner ensures serializability by running 2PC and two-phase locking on the Paxos leaders. The client selects a 2PC coordinator that communicates with the other non-coordinator leaders of the Paxos group. The 2PC coordinator is the leader of that group too. The rest of the Paxos leaders are participants, and the client notifies the group leaders of the coordinator's identity. It also tells the participants the number of buffered writes via a commit message.

If the coordinator crashes, 2PC fails. To cater to it and ensure fault tolerance of the system, all states of the 2PC for both the coordinator and participant are stored in the Paxos state machine. If one of them were to go down in the middle of a 2PC round, the new leader would have all the necessary information to complete the commit.

Press + to interact

Non-coordinator role

A leader who isn't the coordinator gets access to write locks. To guarantee monotonicity, it chooses a prepare timestamp after any timestamps assigned to prior transactions, and the prepared record is logged via Paxos. After that, all participants communicate their prep time to the leader.

Coordinator role

The coordinator leader bypasses the prepare step and gets locks for the write. After receiving input from all the group's leaders, it selects a single timestamp for the entire transaction. Let's denote the commit transaction as $s$ and it should be as follows:

Greater than or equal to all prepare timestamps to satisfy the invariants of read-write transactions
Greater than $TT.now().latest$ (latest value is fetched when the client sends a commit message to the coordinator)
Greater than the timestamps of all the transactions that the leader coordinator has assigned previously

All of the above help maintain invariants like monotonicity and constraints of read-write transactions.

Another constraint is commit wait. Therefore, the leader coordinator will wait till $TT.after(s)$ ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Database Operations in Spanner

Read-write transactions

Two-phase commit in Spanner

Non-coordinator role

Coordinator role