...

/

Distributed Snapshot Problem

Distributed Snapshot Problem

Let's look into the problem of capturing distributed snapshots.

There is another basic problem in distributed systems that is strongly related to the notion of time and order. It is stated below.

Problem

Here is the problem: how to record a snapshot of the state of a distributed system comprising of multiple nodes that perform a continuous computation?

There are many more problems in distributed systems that can be expressed in terms of the problem of detecting a global stateThis problem is also known as stable property detection and can have many different usages, such as detection of deadlocks or termination of a computation. and specific properties associated with it. We will only focus on the distributed snapshots problem in this lesson.

Distributed snapshots can be used as a recovery mechanism from a point in the past when failures happen.

Capturing distributed snapshots

A seminal algorithm used for capturing distributed snapshots is the Chandy-Lamport algorithmK. M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Transactions on Computer Systems (TOCS), Volume 3 Issue 1, Feb. 1985, 1985..

State of a distributed system

The state of a distributed system consists of the state of the various nodes and any messages that are in transit between the nodes.

Challenge in recording the state

The main challenge in recording the state of a distributed system is that the nodes that are part of the system do not have a common clock. So, they cannot record their local states at precisely the same instant.

As a result, the nodes have to coordinate with each other by exchanging messages so that each node records its state and the state of associated communication channels. Thus, the collective set of all node and channel states forms a global state.

Furthermore, any communication required by the snapshot protocol should not alter the underlying computation.

The paperK. M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems,” ACM Transactions on Computer Systems (TOCS), Volume 3 Issue 1, Feb. 1985, 1985. presents a very interesting and illuminating analogy for this problem.

Analogy

Imagine a group of photographers observing a panoramic, dynamic scene such as a sky filled with migrating birds. This scene is so big that a single photograph cannot capture it. As a result, the photographers must take several snapshots and piece them together to form a picture of the overall scene. The snapshots cannot be taken at the same time, and the photographers should not disturb the process that is being photographed, i.e., they cannot get all the birds to remain motionless while the photographs are taken.

However, the composite picture should be meaningful.

Relating the analogy to distributed systems

As discussed in the analogy, the composite picture should be meaningful. This need for a meaningful snapshot also exists when talking about distributed systems.

For example, there’s no point in recovering a snapshot if that snapshot can lead the system to an erroneous or corrupted state.

A meaningful snapshot is termed as a consistent snapshot in the paper, which presents a formal definition of what this is.

An alternative definition is that of a consistent cut by FriedemannM. Friedemann, “Virtual Time and Global States of Distributed Systems,” Parallel and Distributed Algorithms, 1988., which partitions the space-time diagram along the time axis in a way that respects causality, e.g., for each pair of events ee and ff, if ff is in the cut and efe\to f, then ee is also in the cut.

Note that the Chandy-Lamport algorithm produces snapshots that are also consistent cuts.

The formal definition of consistent snapshot will be presented here in a more simplified way for ease of understanding.

Consistent snapshot

Let’s assume a distributed system can be modeled as a directed graph, where vertices represent nodes of the system and edges represent communication channels.

An event ee ...