Clocks
Learn about the importance of clocks in a distributed system.
We'll cover the following
Why is the time necessary?
Before we go deeper into the phenomenon of clock skew, let’s look at why the notion of time is essential in software applications:
If we are watching a video online and the video frames arrive out of order so that latter scenes appear before the earlier scenes, then the user experience isn’t good.
The notion of time becomes very important in the field of observability to calculate essential metrics. For example:
What is the average latency of a particular endpoint?
What is the throughput of a specific service over a sliding window of five minutes?
What is the peak time during which most of the traffic on the website occurs?
In database replication, time synchronization is a fundamental construct for replicating and sequencing the data. If the transactions arrive out of order, such that the newer transactions execute before the older transactions, the results are catastrophic leading to data loss.
Having a deterministic notion of time is very helpful when tracing a distributed request for measuring the time taken by each component and the order of execution.
Thus the notion of time and order of events based on time plays a vital role in the success of a distributed system.
Computer clocks
Every computer ships with a hardware clock made of a quartz crystal oscillator. Quartz crystals oscillate with a specific frequency when a particular voltage is applied, and the clock counts these oscillations. A specified number of oscillations is called a tick. Every tick represents a unit of time. The clock internally manages a counter and increments it to mark a tick. The operating system contains a software clock that uses the hardware clock.
These hardware devices are not precisely accurate, so each computer can have its notion of time, which can be ahead of or behind the global notion of time. Moreover, the oscillations can be affected by the physical location of the computer and temperature variations. So, in a distributed setup, we cannot always assume that two clocks running on two different hosts report the same time.
For example, in a leaderless database like Cassandra, which uses last write wins as a conflict resolution strategy, data loss is inevitable if two hosts have different notions of time. There are two different terms for time differences:
Clock skew: The difference between the time on two clocks running on two host machines.
Clock drift: The drift or deviation of a particular clock oscillation from a standard benchmark like an atomic or GPS clock.
These clocks synchronize their ticks with Network Time Protocol (NTP), allowing the local clocks to be adjusted to the time according to a group of servers. These groups of servers get their time from a more accurate source, like a GPS, atomic, or radio clock.
There are two types of clocks:
Time-of-day clock
Monotonic clock
Time-of-day clock
The time-of-day clock returns the current date and time according to a calendar reference. The time-of-day clock is also called wall clock time. These clocks periodically synchronize with NTP and adjust either forward or backward. Linux's CLOCK_REALTIME
is an implementation of the time-of-day clock.
Monotonic clock
A monotonic clock measures a duration or elapsed interval from an arbitrary fixed point in the past. It always moves forward, and instead of adjusting the time while synchronizing with NTP, it adjusts the clock rate. Linux's CLOCK_MONOTONIC
is an implementation of the monotonic clock.
There are other variations of clocks, like atomic clocks, which give precise time without a lot of drift. They work by monitoring the resonant frequency of atoms. Atomic clocks are too expensive to be installed on commodity hardware, however, and are primarily used in space exploration, GPS, and military use cases.
NTP
NTP is a UDP-based protocol to synchronize local clocks in a variable packet-switched network by choosing suitable time servers in a hierarchy of servers called stratum. NTP synchronizes clocks within a few milliseconds of the UTC.
NTP includes a hierarchy of time servers called stratum. There are 16 layers of stratum starting from 0 to 15.
Stratum 0 represents the most accurate clocks like GPS, atomic, and radio clocks. They are called reference clocks and they don’t synchronize among themselves.
Stratum 1 devices maintain a direct connection to stratum 0 and might be a few microseconds behind stratum 0 clocks. They also internally synchronize among themselves in case stratum 0 is unavailable. They are called primary clocks.
Stratum 2 devices are directly connected to stratum 1 and might be a few milliseconds behind stratum 1 clocks. They also internally synchronize among themselves for better accuracy.
This hierarchy of stratum can continue up to 15 levels. The hierarchy of stratum servers is established for scalability reasons since every host machine cannot connect to stratum 0 clocks.
Note: Most tech giants maintain their own NTP servers at the stratum 3 layer, which connects to stratum 2 clocks. All the host machines, in turn, link to those companies’ in-house NTP servers.
Get hands-on with 1400+ tech skills courses.