Introduction to Kafka Streams

Learn about the core concepts of Kafka Streams and architectural considerations like scalability and fault tolerance.

What is Kafka Streams?

Kafka Streams is a Java library for building real-time, scalable, and fault-tolerant streaming applications that process data in motion. It allows developers to build complex stream processing applications using simple and concise Java code, leveraging the power of Kafka’s distributed architecture.

Some of its key benefits include the following:

  • It’s only a library: Kafka Streams is a Java library, not a platform. We can treat it as any other Java dependency and include it in new and/or existing applications. Another useful outcome of Kafka Streams only being a library is that it makes it easy to deploy and scale Kafka Streams applications. We can continue to use our existing deployment models or choose from many options, including on-premises, Cloud, Docker containers, Kubernetes, etc.

  • Tight integration with Apache Kafka: Kafka Streams uses Apache Kafka as its underlying storage and messaging system, which means it inherits many of the benefits of Kafka, such as scalability, fault tolerance, and high availability. As such, Kafka is the only dependency for Kafka Streams applications, making it easy to deploy and scale applications in practice.

  • Stateless and stateful processing: While stateless operations are common, Kafka Streams also supports stateful computations on streaming data. This is made possible by a combination of state stores and Interactive Queries (more on these concepts later in this lesson).

The Kafka Streams APIs

Kafka Streams provides two types of APIs for building stream processing applications:

  • The DSL (domain-specific language) API

  • The Processor API

The DSL API is a high-level API that allows developers to build stream processing applications using a fluent, easy-to-use syntax. With the DSL API, developers can perform common stream processing operations like filtering, mapping, and aggregating, as well as more complex operations like joins and windowing, using simple and concise Java code.

The Processor API, on the other hand, is a lower-level API that provides more fine-grained control over the stream processing pipeline. With the Processor API, developers can define custom stream processing operators and connect them in a directed acyclic graph (DAG) to create a complete stream processing topology. This gives developers more flexibility and control over the stream processing pipeline but also requires more low-level coding and understanding of Kafka Stream’s internals.

The DSL API is generally recommended for most stream processing use cases because it provides a simpler and more intuitive way to build stream processing applications. However, the Processor API may be more appropriate for certain use cases that require more complex or custom processing.

Kafka Streams: key concepts

In this lesson, we will encounter many different concepts and terminology related to Kafka Streams. They are the building blocks of Kafka Streams applications. Now is a good time to get familiar with them.

  • Stream: is an unbounded, continuously updating sequence of records produced and consumed in real time. It contains an ordered, replayable sequence of immutable data records, where a data record is defined as a key-value pair.

  • Topology: is a directed acyclic graph (DAG) representing the stream processing pipeline in a Kafka Streams application. A topology consists of a set of stream processing nodes, each representing a specific operation or processing step in the pipeline. Topologies can be defined using either the DSL API or the Processor API.

  • Application: is a standalone Java application that uses the Kafka Streams library to process data in real time. A Kafka Streams application consists of one or more stream processing topologies, which are executed by a set of Kafka Streams instances running in a distributed environment.

  • Stream partition: is a subset of a stream that contains a specific subset of the records in the stream. In Kafka Streams, a stream partition distributes the processing load across multiple application instances. There are close links between Kafka Streams and Apache Kafka in the context of parallelism—each stream partition maps to a Kafka topic partition. A data record in the stream partition maps to a Kafka message from that topic, and the keys of data records determine how data is routed to specific partitions within topics.

  • Task: is a unit of work that is performed by a Kafka Streams instance. Each Kafka Streams instance is responsible for one or more tasks, where each task processes one or more stream partitions. Tasks are dynamically assigned to Kafka Streams instances by the Kafka Streams runtime based on the number of partitions and the processing load.

Scalability of Kafka Streams applications

Kafka Streams applications can scale horizontally by adding more instances of the application to the processing cluster. Kafka Streams provides a built-in partitioning mechanism that allows it to distribute the processing load across multiple instances of an application. Specifically, Kafka Streams uses the number of stream partitions as the basis for scaling. Each instance of an application is responsible for processing a subset of the total number of stream partitions, and the processing load is evenly distributed across the instances.

This scaling mechanism is made possible by Kafka’s underlying distributed architecture. Kafka’s partitioning scheme ensures that each partition is assigned to a specific instance of the application and that each instance only processes the partitions that it has been assigned. This enables Kafka Streams applications to scale horizontally by simply adding more instances of the application to the processing cluster, without requiring any changes to the application logic. In addition, Kafka Streams provides automatic rebalancing of stream partitions when new instances of the application are added or removed, ensuring that the processing load remains evenly distributed across the instances.

Get hands-on with 1200+ tech skills courses.