The DSL API’s Stateful Operations: Aggregation

Learn how Kafka Streams supports aggregation operations in the DSL API.

Kafka Streams DSL API—stateful operations

As discussed previously, stateful operations require the internal state to be maintained across multiple data records. They are typically used for more complex transformations or aggregations that require data to be processed over time, rather than on a record-by-record basis. Stateful operations are important because they allow developers to perform complex computations on data streams that require context and historical information.

Here are some of the key classes that support stateful operations in the Kafka Streams DSL API:

  • KGroupedStream: It represents a grouping of records in a Kafka topic stream. This class is created by calling the groupByKey method on a KStream instance.

  • KGroupedTable: It is an abstraction of a re-grouped changelog stream from a primary-keyed table, usually on a different grouping key than the original primary key. This class is created by calling the groupBy method on a KTable instance.

  • Materialized: It provides a fluent builder API to configure the state store. The configuration options include the state store name, the key and Serdes values, retention period, etc.

  • State stores: They are used to persist the intermediate results of stateful operations that require access to previously processed records, such as aggregations and joins. These stores are implemented as key-value stores and can be either in-memory or disk-based. They are partitioned and distributed across the cluster for fault tolerance and scalability.

  • Window: Windowing lets us control how to group records that have the same key for stateful operations, such as aggregations or joins, into windows, which are tracked per record key.

Aggregation

Aggregation is a powerful technique for performing complex computations on streaming data. It is particularly useful for applications that require real-time analytics on streaming data, such as fraud detection, user behavior analysis, and predictive maintenance. In the Kafka Streams DSL API, aggregation is a category of stateful operations that allow us to perform computations on a stream of data based on a key. Its purpose is to generate a summary of the data in the stream that can be used for further analysis or to produce output to another stream or data store.

After records are grouped by key via the groupByKey or groupBy methods (and covered in a KGroupedStream or a KGroupedTable), they can be aggregated using one of the following methods: aggregate, count, or reduce. Since aggregations are key-based operations, it means that they always operate over values of the same key.

Let’s explore aggregation in detail by looking at each of these methods.

The aggregate function

Get hands-on with 1400+ tech skills courses.