...
/Detailed Design of Chubby: Part IV
Detailed Design of Chubby: Part IV
Learn about fault-tolerance, scalability, and availability of Chubby's design.
Failovers
Nodes can fail, and we need to have an excellent strategy to minimize downtime. One way to reduce downtime is a fast failover. Let’s discuss the failover scenario and how our system handles such cases.
A primary replica discards its state about sessions, handles, and locks when it fails or loses leadership. This must be followed by the election of a new primary replica with the following two possible options:
- If a primary replica gets elected rapidly, clients can connect with the new primary replica before their own approximation of the primary replica lease’s (local lease) duration runs out.
- If the election extends for a longer duration, clients discover the new primary replica by emptying their caches and waiting for the grace period. As a result, our system can keep sessions running during failovers that go past the typical lease time-out, thanks to the grace period.
Once a client has made contact with the new primary replica, the client library and the primary replica give this impression to the application that nothing went wrong by working together. The new primary replica must recreate the in-memory state of the primary replica that it replaced. It accomplishes this in part by:
- Reading data that has been duplicated via the standard database replication technique and kept safely on disk
- Acquiring state from clients
- Making cautious assumptions
Note: All sessions, held locks, and ephemeral files are recorded in the database.
Newly elected primary replica proceedings
The proceedings of the newly elected primary replica are shown in the following slides.
Quiz
After analyzing the proceedings of the newly elected primary replica shown in the slides above, answer the following questions:
In slide 8 of the slide deck above, how does the new primary replica prevent the recreation of an already closed handle?
Why does the client library struggle to keep the old session going? Why not simply make a new session when we can contact the new primary replica?
Example scenario
The following illustration depicts the progression of an extended primary replica failover event where the client must use its grace time to keep the session alive. The time grows from top to bottom, but we won’t scale it since it’s not linear.
Note: Arrows from left to right show KeepAlive requests, and the ones from right to left show their replies.
Note: During the grace period, the client is uncertain as to whether its lease at the primary replica has ended at this time. It does not end its session, but it does stop all API requests from applications to stop them from seeing erroneous data.
Point to ponder
How did Google implement Chubby’s database?
Backup
Each Chubby cell’s primary replica takes a snapshot of its database every few hours and uploads it to a GFS file server located in a separate building. Using a different site guarantees that the backup in GFS will endure any damages on a single site and that the backups do not incur cyclic dependencies into the system; otherwise, a GFS cell deployed at a single site may depend on the Chubby cell to choose its primary replica.
Backups offer both disaster recovery and a way to set up the database of a replacement replica without putting additional stress on active replica servers.
Mirroring
Chubby enables