Data Storage

This lesson explains how Kafka can be configured to determine when to delete data and the format in which data is stored on disk. Later, the lesson discusses compressing Kafka messages.

Data retention

Kafka doesn’t hold data in perpetuity. The admin can configure Kafka to delete the messages for a topic in two ways:

  • Specify a retention time after which messages are deleted.

  • Specify the data size to be reached before messages are deleted.

In either scenario, Kafka will not wait for consumers to read messages and delete them when the deletion criteria is met. Data for a partition isn’t a contiguous file. Rather, the data is broken into chunks of files called segments. Each segment can be at most 1GB in size or contain a week’s worth of data, whichever is smaller. The segment currently being written to is known as the active segment and is closed as soon as it reaches 1GB when a new file is opened for writing. Having segments makes deleting stale data much easier than attempting to delete messages in one long contiguous file.

Note that the active segment can never be deleted. This has consequences on the deletion policy. Say you configure Kafka to retain data for only one day. If the topic doesn’t experience high traffic, it is possible that the active segment also contains data for previous days and the data for those days will not be deleted since it is in the active segment. On the other hand, if we configure Kafka to store a week’s worth of data and a new segment gets created for each day, then we shall see the partition contain seven segments most of the time.

Also, note that the broker keeps an open handle to all the segments in all the partitions, including the inactive ones.

File format

Each segment is one data file. The file consists of Kafka messages and their offsets. Interestingly, the message format on disk is the same as on the wire, (i.e. the message as received from the producers are stored on disk without any alterations in format and sent as they were received to the consumers). This allows Kafka to use the zero-copy optimization when sending messages to consumers and avoids any decompressing when messages are received and compressing when messages are sent out.

Each message contains information other than the key, value, and offset. This includes:

  • checksum to detect corruption
  • timestamp which can be set either to when the message was received by the broker or when sent by the producer, depending upon configuration
  • compression type such as Snappy, GZip, LZ4
  • magic byte to detect the version of the message format

Note that if the producer is employing compression on its end, then the message sent by the producer to the broker is a single wrapper message which has as its value all the messages in the batch compressed together. The broker stores this wrapper message as it is and sends it out to the consumer when requested. The consumer decompresses the wrapper message and is able to see all the messages in the batch along with their offsets and timestamps.

Get hands-on with 1400+ tech skills courses.