Building Scalable Data Pipelines with Kafka/

...

Data Storage

This lesson explains how Kafka can be configured to determine when to delete data and the format in which data is stored on disk. Later, the lesson discusses compressing Kafka messages.

We'll cover the following...

Data retention
File format
Index
Compaction

Data retention

Kafka doesn’t hold data in perpetuity. The admin can configure Kafka to delete the messages for a topic in two ways:

Specify a retention time after which messages are deleted.
Specify the data size to be reached before messages are deleted.

In either scenario, Kafka will not wait for consumers to read messages and delete them when the deletion criteria is met. Data for a partition isn’t a contiguous file. Rather, the data is broken into chunks of files called segments. Each segment can be at most 1GB in size or contain a week’s worth of data, whichever is smaller. The segment currently being written to is known as the active segment and is closed as soon as it reaches 1GB when a new file is opened for writing. Having segments makes deleting stale data much easier than attempting to delete messages in one long contiguous file.

Note that the active segment can never be deleted. This has consequences on the deletion policy. Say you configure Kafka to retain ...

Basics

Kafka Producer

Kafka Consumer

Kafka Internals

Conclusion

Appendix

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Data Storage

Data retention