Buckets and placement

Spanner provides an additional layer of abstraction over the bag of key-value mappings in the form of a directory or bucket—a group of adjacent keys that all begin with the same prefix. Applications that support buckets manage the data locality by carefully selecting keys.

The basic organizational structure for data is a bucket. All the bucket's data share the same replication settings. Consider the illustration below. The data is transferred between Paxos groups bucket by bucket. To minimize the load on a Paxos group, we can relocate the frequently accessed buckets into the same Paxos group or place a bucket geographically closer to its accessors. Changing a bucket's location doesn't have to interrupt service for the client. Normally, copying 50 MB of data to a new bucket would take a few seconds.

Given that a Paxos group may have several buckets, the tablet in Spanner and Bigtable differs in a way that the Spanner tablet does not need to be a single and lexicographically contiguous partition of the row space. A Spanner tablet is an enclosure containing many row-ranges. It allows co-locating numerous frequently used buckets together.

Relocating buckets

Spanner uses the movedir function to relocate buckets across Paxos groups. The movedir also adds or removes replicas from Paxos groups. We don't implement movedir as a single transaction to prevent a large data move from stalling ongoing reads and writes. Instead, it keeps track of when it has started moving and moves the data in the background. After all data (except a small amount of data) has been moved, the remaining small quantity will be moved in one atomic operation while the metadata for the two Paxos groups is updated.

A bucket is the smallest unit for which an application can specify its replicas' geographical replication attributes or placement. Administrators have command over two dimensions:

The total number and type of replicas
The geographical placement of replicas

Types of replicas

In a multi-region setup, replicas can be of different types (more on that later), whereas single-region instances only use read-write replicas. For a write transaction, Spanner requires majority voting replicas to ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Database Buckets and Data Model of Spanner

Buckets and placement

Relocating buckets

Types of replicas