System Design Deep Dive: Real-World Distributed Systems/

...

/

Concurrency Control in Spanner

In this lesson, we’ll learn how TrueTime guarantees correctness properties around concurrency control In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, the concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible. [source: Wikipedia]and how those properties ensure the following:

1. Externally consistentExternal consistency is a property of transaction-processing systems, where clients dynamically synthesize transactions that contain multiple read and write operations on arbitrary objects. [source: Google Cloud] transactions

2. Reads from the past that do not block any ongoing transaction

3. Read-only transactions that do not require any lock

These features make it possible to ensure that at any time, t, the full database audit read will see the exact results of all transactions that are committed as of t.

We will use Paxos write and Spanner clients write to differentiate between both writes.

Timestamp management

The implementation of Spanner supports reads from a snapshot and read-only and read-write transactions. The system implements the read-write transactions as standalone writes, whereas the read-only transactions are implemented as non-snapshot standalone reads. The following table shows the operation types Spanner supports and their respective concurrency control.

Read-only transactions

The advantages of snapshot isolation are utilized by the read-only transaction. A read-only transaction requires pre-declaration of being read-only and not having any writes, since it is not a read-write transaction without any writes. Spanner executes reads within a read-only transaction without locking at a timestamp $t$ which is determined by the system. Executing the read at $t$ avoids blocking the incoming writes. This is an example of any suitably up-to-date replica that may handle the reads in a read-only transaction.

Once a timestamp has been chosen, the commit is unavoidable for both snapshot reads and read-only transactions. The commit does not happen if the data has been garbage collected at that timestamp. Therefore, clients can escape a retry loop's buffering of results by receiving them immediately. Clients can continue their queries on other servers if a server goes down by sending the same timestamp and reading location.

Paxos leader leases

Timed leases are used in our Paxos implementation for sustainable leadership, that is, by default, 10 seconds. A node sends the request for timed lease votes to other participants to become a leader. When that node has a quorum of lease votes from other participants, it means it has the lease to become a leader.

For leader election, Paxos uses the Bully algorithm. If the proposer is not proposing to be a leader, any node can aim to be a leader. For a particular node to be accepted by the cluster, the node sends its server IDs to all peers within the cluster. The peers respond with their server ID. If all responses are smaller than the node's server ID, that node becomes the leader.

After a successful write, a replica's lease vote is automatically extended, and the leader will ask for a lease vote extension if it is about to expire. The lease interval of the leader starts when it receives the necessary number of votes to constitute a quorum, and the lease interval will end when the leader no longer has a quorum of votes. Within a Paxos group, a Paxos leader can change over time for different reasons. Spanner enforces a disjointness invariant, which implies that lease intervals of leaders are disjointed across all leaders of the shard over time.

Spanner implementation allows an abdication of a Paxos leader by freeing the lease votes of the participants. Spanner limits the conditions under which abdication is allowed to ensure that the disjointness invariant is maintained. The conditions are as follows:

Consider that the Paxos leader defines the maximum timestamp as $S_{max}$ .
Before abdication, the Paxos leader should wait for $TT.after(s_{max})$ .

The abdication of the Paxos leader is explained in the following slides:

To work, for each Paxos group, we require a disjointness invariant to be true, which means that the lease intervals of each Paxos group leader should be disjointed from other Paxos group leaders.

Read and write transactions

The read and write transactions use two-phase locking. The timestamp can be assigned to a transaction after acquiring all locks and before releasing them. Spanner assigns a timestamp for any given transaction. Spanner enforces the following invariants:

The disjointness invariant
The external consistency invariant

Spanner assigns timestamps to Paxos writes in increasing order in each Paxos group. The assignment is in monotonically ordered across leaders too. The assignment of timestamps in monotonically ascending order is a simple task for a single leader replica. Using the disjointness invariant, all leaders must comply with this rule: a leader can assign timestamps only during their leader lease. It means when a timestamp $s$ is assigned to Paxos leader, $s_{max}$ is incremented to maintain disjointness.

In the external consistency invariant, if the $T2$ transaction is started after $T1$ 's commit, we can infer from it that $T2$ 's commit timestamp should be greater than $T1$ 's commit timestamp. Consider the following for the transaction $T_{i}$ :

Start of $T_{i}$ : $e_{i}^{start}$
Commit of ...

Operation	Concurrency Control
Read-Write Transaction	Pessimistic
Read-Only Transaction	Lock-free
Snapshot Read, client-provided timestamp	Lock-free
Snapshot Read, client-provided bound	Lock-free

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Concurrency Control in Spanner

Timestamp management

Operation Types and Their Concurrency Control

Read-only transactions

Paxos leader leases

Read and write transactions