Kafka Streams DSL API

Kafka Streams provides a domain-specific language (DSL) API that’s designed to build complex stream processing applications in a simple and concise way. This is because the DSL API is based on functional programming principles, which makes it highly composable and modular. It supports a wide range of stream processing operations, including:

  • Stateless transformations (e.g., map and filter)

  • Stateful operations (e.g., aggregations)

  • Join and windowing operations

Let’s quickly review these operations.

Kafka Streams DSL API operations

Stateless transformations

Stateless operations in Kafka Streams are a set of operations that are designed to operate independently on each data record in a stream. These operations do not require any internal state to be maintained across data records. Therefore, they are ideal for simple transformations or filtering operations. Stateless operations in Kafka Streams can be used to perform a wide range of data transformations such as filtering, mapping, flattening, and more. In addition, stateless operations can be easily combined with other stateful operations in Kafka Streams to create complex processing pipelines that can handle a wide range of data processing tasks.

Stateful operations

Stateful operations in Kafka Streams are a set of operations that require the internal state to be maintained across multiple data records. These operations are typically used for more complex transformations or aggregations that require data to be processed over time, rather than on a record-by-record basis. Stateful operations are important because they allow developers to perform complex computations on data streams that require context and historical information. For example, an aggregation operation that computes the average temperature over the last hour requires historical information about the temperature readings from the previous hour.

Join and windowing operations

Join operations in Kafka Streams allow developers to combine data from multiple streams based on a common key. They are commonly used in stream processing applications to perform data enrichment, where data from multiple streams is combined to produce a more complete or useful dataset. A popular pattern is to make the information in the databases available in Kafka using the Kafka Connect API (covered later). Then, implement applications that leverage the Kafka Streams API to perform fast and efficient local joins of such tables and streams, rather than requiring the application to make a query to a remote database over the network for each record.

Windowing operations in Kafka Streams allow developers to group data records into time-based windows for aggregation and analysis. It lets us control how to group records that have the same key for stateful operations, such as aggregations or joins, into windows that are tracked per record key. Supported window types in Kafka Streams include hopping, time window, tumbling time window, sliding time window, and session time windows.

The DSL API’s key interfaces

These interfaces are part of the org.apache.kafka.streams.kstream package.

The KStream API

The KStream API represents a stream of records in a Kafka topic. It is a continuous, unbounded sequence of data that is processed in real time. The KStream API provides a functional programming interface for processing the data stream, allowing developers to apply various transformations and operations to the data.

The KTable API

The KTable API represents a changelog of updates to a stream of data. It is a distributed, fault-tolerant, and scalable representation of a table that is continuously updated in real time. The KTable API provides a relational database-like interface for querying and processing data, allowing developers to perform a wide range of join and aggregation operations on the stream. The KTable API is designed for stateful operations requiring internal state management, such as real-time database updates or event-driven microservices.

The GlobalKTable API

In contrast to the KTable API that is partitioned over all KafkaStreams instances, the GlobalKTable API is fully replicated per KafkaStreams instance. Every partition of the underlying topic is consumed by each GlobalKTable, such that the full set of data is available in every KafkaStreams instance. This allows performing joins with the KStream API without repartitioning the input stream.

Now that we have an overview of the DSL API in Kafka Streams, let’s cover stateless operations (supported by the KStream API) in detail.

The KStream stateless operations

Let’s explore some of the important stateless operations supported by Kafka Streams.

Common operations

  • filter: This operation allows us to apply a boolean condition to each record in a stream and only pass through the records that satisfy the condition. This operation does not modify the keys or values of the records, but simply removes records that do not meet the specified condition. The filter method takes a Predicate as its argument, representing the boolean condition to be applied to each record.

Get hands-on with 1200+ tech skills courses.