...

/

Failure Recovery in Flink

Failure Recovery in Flink

Let's explore an algorithm used by Flink to recover from failure andAS the guarantees provided by Flink.

As mentioned previously, stream processing applications in Flink are supposed to be long-lived. So there must be an efficient way to recover from failures without repeating a lot of work. For this purpose, Flink periodically checkpoints the operators’ state and the position of the consumed stream to generate this state. In case of a failure, an application can be restarted from the latest checkpoint and continue processing from there.

All this is achieved via an algorithm similar to the Chandy-Lamport algorithm for distributed snapshots, called Asynchronous Barrier Snapshotting (ABS).

Asynchronous Barrier Snapshotting (ABS)

The ABS algorithm operates slightly differently for acyclic and cyclic graphs, so we will examine the first case here, which is a bit simpler.

Working

The algorithm works in the following way:

  • The Job Manager periodically injects some control records in the stream, referred to as stage barriers. These records are supposed to divide the stream into stages. At the end of a stage, the set of operator states reflects the whole execution history up to the associated barrier. Thus it can be used for a snapshot.

  • When a source task receives a barrier, it takes a snapshot of its current state and then broadcasts the barrier to all its outputs.

  • When a non-source task receives a barrier from one of its inputs, it blocks that input until it has received ...