Distributed Snapshot Problem
Let's look into the problem of capturing distributed snapshots.
There is another basic problem in distributed systems that is strongly related to the notion of time and order. It is stated below.
Problem
Here is the problem: how to record a snapshot of the state of a distributed system comprising of multiple nodes that perform a continuous computation?
There are many more problems in distributed systems that can be expressed in terms of the problem of
and specific properties associated with it. We will only focus on the distributed snapshots problem in this lesson. detecting a global state This problem is also known as stable property detection and can have many different usages, such as detection of deadlocks or termination of a computation.
Distributed snapshots can be used as a recovery mechanism from a point in the past when failures happen.
Capturing distributed snapshots
A seminal algorithm used for capturing distributed snapshots is the
State of a distributed system
The state of a distributed system consists of the state of the various nodes and any messages that are in transit between the nodes.
Challenge in recording the state
The main challenge in recording the state of a distributed system is that the nodes that are part of the system do not have a common clock. So, they cannot record their local states at precisely the same instant.
As a result, the nodes have to coordinate with each other by exchanging messages so that each node records its state and the state of associated communication channels. Thus, the collective set of all node and channel states forms a global state.
Furthermore, any communication required by the snapshot protocol should not alter the underlying computation.
The
Analogy
Imagine a group of photographers observing a panoramic, dynamic scene such as a sky filled with migrating birds. This scene is so big that a single photograph cannot capture it. As a result, the photographers must take several snapshots and piece them together to form a picture of the overall scene. The snapshots cannot be taken at the same time, and the photographers should not disturb the process that is being photographed, i.e., they cannot get all the birds to remain motionless while the photographs are taken.
However, the composite picture should be meaningful.
Relating the analogy to distributed systems
As discussed in the analogy, the composite picture should be meaningful. This need for a meaningful snapshot also exists when talking about distributed systems.
For example, there’s no point in recovering a snapshot if that snapshot can lead the system to an erroneous or corrupted state.
A meaningful snapshot is termed as a consistent snapshot in the paper, which presents a formal definition of what this is.
An alternative definition is that of a consistent cut by
Note that the Chandy-Lamport algorithm produces snapshots that are also consistent cuts.
The formal definition of consistent snapshot will be presented here in a more simplified way for ease of understanding.
Consistent snapshot
Let’s assume a distributed system can be modeled as a directed graph, where vertices represent nodes of the system and edges represent communication channels.
An event ...