System Design Deep Dive: Real-World Distributed Systems/

...

Efficiency of Kafka

Learn about the design decisions of Kafka that make it efficient.

We'll cover the following...

Simple storage
Efficient transfer
Stateless broker

Kafka has a few features that make it efficient, like simple storage, efficient transfer, and a stateless broker.

Simple storage

Kafka has a simple layout for storage. Key parts of its features are discussed below.

Implementation of a partition

A partition can be implemented as a large file. However, Kafka does not keep the data forever. It has to clean older data from the disk to make way for new data. If partitions are implemented as a large file, it is hard to find and clean data that is no longer needed. Therefore, a partition in a topic is implemented like a logical log that comprises a set of segment files approximately of the same size. This way, we can append new messages in a segment file by deleting messages from the oldest updated segment file without having to find and delete a part of a large file and then append data to it.

The broker appends each message to the last used segment file or an active segment whenever a producer publishes it to a partition. These segment files are flushed to the disk. However, to achieve better performance, the system waits till a segment file has gathered a certain amount of data or if a certain amount of time has passed before writing it to the disk, whichever happens first. A segment file usually contains either 1 GB or a week’s data. Consumers can only read a message after it has been flushed to the disk.

Message IDs or offsets

Kafka stores each segment in a single file that comprises messages and their offsets. In other messaging systems, every message has its ID. However, Kafka addresses each message by a logical offset. This feature eliminates the overhead of maintaining random access data structures that map an ID to locations in storage where messages are saved. The offset of messages in a partition is in increasing order, but it’s not a constant increase for each message because each message can be of different lengths. To compute the ID of a new message being appended to a partition, we must add its length to the offset of the last message.

Consumption of messages

A consumer sends pull ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Efficiency of Kafka

Simple storage

Implementation of a partition

Message IDs or offsets

Consumption of messages