Efficiency of Kafka

Learn about the design decisions of Kafka that make it efficient.

Kafka has a few features that make it efficient, like simple storage, efficient transfer, and a stateless broker.

Simple storage

Kafka has a simple layout for storage. Key parts of its features are discussed below.

Implementation of a partition

A partition can be implemented as a large file. However, Kafka does not keep the data forever. It has to clean older data from the disk to make way for new data. If partitions are implemented as a large file, it is hard to find and clean data that is no longer needed. Therefore, a partition in a topic is implemented like a logical log that comprises a set of segment files approximately of the same size. This way, we can append new messages in a segment file by deleting messages from the oldest updated segment file without having to find and delete a part of a large file and then append data to it.

The broker appends each message to the last used segment file or an active segment whenever a producer publishes it to a partition. These segment files are flushed to the disk. However, to achieve better performance, the system waits till a segment file has gathered a certain amount of data or if a certain amount of time has passed before writing it to the disk, whichever happens first. A segment file usually contains either 1 GB or a week’s data. Consumers can only read a message after it has been flushed to the disk.

1.

What if a broker fails before flushing the segment files to the disk?

Show Answer
Q1 / Q2

Message IDs or offsets

Kafka stores each segment in a single file that comprises messages and their offsets. In other messaging systems, every message has its ID. However, Kafka addresses each message by a logical offset. This feature eliminates the overhead of maintaining random access data structures that map an ID to locations in storage where messages are saved. The offset of messages in a partition is in increasing order, but it’s not a constant increase for each message because each message can be of different lengths. To compute the ID of a new message being appended to a partition, we must add its length to the offset of the last message. ...

Access this course and 1400+ top-rated courses and projects.