Kafka promised to be efficient in collecting data from multiple producers in parallel, retaining data, and delivering it to multiple consumers simultaneously. Moreover, it promised to deliver loads of data in real time. Let's go through some pieces of evidence as to how Kafka provides these functionalities by comparing the performance of Kafka with Apache ActiveMQ (a popular open-source implementation of Java Message Service (JMS)) and RabbitMQ (a messaging system known for its performance).

All the computational results and time spentMany of the performance numbers and graphs in this lesson are inspired from the paper: Kreps, Jay, Neha Narkhede, and Jun Rao. "Kafka: A distributed messaging system for log processing." In Proceedings of the NetDB, vol. 11, pp. 1-7. 2011. on them that is stated in the text below are done on two Linux machines, both of which have eight 2GHz cores, 16 GB of memory, and 6 disks with RAID 10. Both the Linux machines are connected through a 1 GB network link. One is deployed as a broker, and the other performs the function of both the producer and consumer interchangeably. Though such an experimental setup might seem minuscule, Kafka can extract amazing throughput from this setup. Since Kafka is horizontally scalable, it will not be a stretch to extrapolate these numbers for a larger setup (for example, for back-of-the-envelope calculations).

Performance improvements

To check the improved performance of Kafka, we’ll have to analyze the messages going from producer to brokers and from brokers to consumers.

Producer throughput

ActiveMQ and RabbitMQ don't have any simple way to send batched messages, so only 1 message is sent to the broker at any given time. However, if we use a single producer at each system to produce 10 million messages, each message being 200 bytes in size, and send these messages in batches of 1 to 50, Kafka can publish 50,000 to 400,000 messages per second, respectively. The results achieved by Kafka are orders of magnitude better than ActiveMQ’s results and twice as better as RabbitMQ’s results.

The reasons why Kafka's producer shows this improved performance are listed as follows:

  • Kafka's producer sends as many messages to the broker as the broker can process without waiting for any kind of acknowledgment from it.

  • Kafka possesses a simple and efficient storage system. On average, Kafka only had an overhead of 9 bytes per message as opposed to ActiveMQ's 144 bytes. ActiveMQ's overhead comes from two sources:

    • A large message header that JMS requires

    • Maintenance of indexing structures

Batch processing

Kafka’s batching is the key to its achieved improvement in performance because sending a batch of messages also reduces the remote procedure call (RPC) overhead. Moreover, if the systems are far away from each other, batching will be able to make maximum use of the RTT. The improved throughput of Kafka’s producer as compared to ActiveMQ and RabbitMQ and the magnitude of improvement in batch processing adds to its performance. This can be seen in the following illustration.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.