Introduction to Kafka Streams

Kafka Streams is a lightweight Java library that allows us to build applications and microservices for processing real-time streams of data. Leveraging Kafka’s technology and features, applications built with Kafka Streams are highly scalable and fault-tolerant. It has many use cases, ranging from fraud detection and financial data processing to genetic data enrichment and monitoring.

An important aspect of Kafka Streams is that apart from a Kafka cluster, no additional infrastructure or clusters are required—unlike other processing frameworks such as Apache Spark or Apache Storm. It’s just a dependency added to our application, simple as that. The library itself is an open-source project and a core feature of the Apache Kafka ecosystem, which includes Kafka Core, Kafka Streams, Kafka Connect, Schema Registry, and Kafka REST proxy.

var builder = new StreamsBuilder();
KStream<Void, String> greetingsStream = builder.stream(INPUT_TOPIC);
greetingsStream
.peek((k, v) -> System.out.println("Received new message: " + v))
.filterNot((k, v) -> v.isBlank())
.mapValues(v -> "Hello, " + v)
.peek((k, v) -> System.out.println("Sending greeting: " + v))
.to(OUTPUT_TOPIC);
var streams = new KafkaStreams(builder.build(), buildConfiguration());
streams.start();
A simple Kafka Streams application

Kafka Streams applications

Before Kafka Streams, developers who wanted to build Kafka-based stream processing applications had to choose from two alternatives.

Kafka’s Consumer and Producer APIs

Using the Consumer and Producer APIs, we could read and write directly from and to Kafka topics. Any processing logic on the data would have to be written from scratch. For simple use cases, this is not a big price to pay. However, as processing logic becomes more complicated, we could find ourselves writing very complicated code that is not even related to our core business.

Using a stream processing framework

The second option would be to use a stream processing framework, such as Apache Storm or Apache Spark. These frameworks give us the tools to perform more complicated operations, but come with the cost of deployment complexity and the overhead of having to maintain another cluster.

Press + to interact

Kafka Streams to the rescue

While the Consumer and Producer APIs are dedicated to moving data to and from Kafka, the goal of Kafka Streams API is to help us process real-time data streams. It allows us to easily consume data directly from Kafka topics and apply transformation logic to it using a high-level DSL, which includes a variety of operators, such as filtering, mapping, aggregation, and windowing.

This API does not require additional clusters to be set up and maintained, and applications written with it are easily scalable and fault-tolerant.

Scalability

When we need to increase the throughput of our Kafka Streams application, all we need to do is start more instances of it. Kafka Streams is deployment-agnostic, so this could be done by simply starting more processes on the same machine or using modern tools like Kubernetes. Because Kafka Streams relies on Kafka, the workload will automatically be distributed among the instances. This concept is covered in depth in the next chapter, so if you don’t have the required knowledge to understand this, don’t worry, we will get there!

Press + to interact
Multiple instances of an application using Kafka Streams
Multiple instances of an application using Kafka Streams

Fault tolerance

By relying on the same features of Kafka, Kafka Streams applications are also fault-tolerant. If one of the application instances fail, the workload will again be automatically redistributed. The record it was processing will not be lost—it will be processed by a different instance.

In the following lessons, we’ll learn about topologies and run a simple “Hello World” Kafka Streams application.