From the dark ages to a renaissance for the databases

With the advent of hyperscale services such as worldwide search, online shopping, messaging, and so on, the deficiencies of the traditional databases (based on the relational data model) became apparent. These deficiencies can be grouped into two classes—scalability challenges and performance challenges.

Traditional databases are optimized for read-heavy workload, where data schema is known at the time of writing and does not change too frequently. Additionally, most implementations of relation DB engines were either based on a single beefy server or a group of servers physically nearby. Such a setup is needed to rely on vertical scalingVertical scaling, also known as “scaling up,” refers to scaling by providing additional capabilities (for example, additional CPUs or RAM) to an existing device. for improvements, though there are limits to such scaling. The workloads for applications were approaching the limits in terms of raw data size and available IOPS (input/output operations per second) with good throughput and latency from the database systems.

These deficiencies pushed organizations on a multi-decade quest to research and develop custom database systems. Primarily, the guiding rule was that for some specific applications, we might not need the full feature set of a relational model, and inventing a new, simpler model would enable us to get highly scalable and highly performant database systems. In this chapter, we will focus on one such system designed by Google, known as Bigtable.

The need for Bigtable

While traditional relational databases apply to many data problems, they are not suitable for important use cases concerning data-size scalability and read/write performance. Some of those use cases are:

Fraud detection: It relies on rules for data detection algorithms, transaction information, customer information, time of day, location, etc., all of which are instantly applied on a big scale. For a common case, most of the data might not be read too frequently, but when needed, we might have to read most of it in near real-time. Such workloads are not suitable for traditional databases.
Time-series data: This concerns data such as cumulative CPU and memory usage across several thousand servers of a data center.
Marketing data: This concerns data such as client preferences and order history. The sheer number of customers and the quest to record fine-grained activity (like clicks) generates an enormous amount of data with a widely variable structure of data.
Financial data: This concerns data such as transaction records, stock markets, and exchange rate changes. On the one hand, the volume of ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Bigtable Deep Dive for System Design

From the dark ages to a renaissance for the databases

The need for Bigtable