Zookeeper

Zookeeper is a crucial piece of any Big Data deployment at enterprise scale. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and group services. All these services are used by distributed applications. According to the official website, Zookeeper gets its name because coordinating distributed systems is a zoo.

At its core, Zookeeper is simple to understand. Think of it as a hierarchical filesystem or a tree. The basic building block of Zookeeper is a znode. A znode can store data (like a file) or have child znodes (like a directory). The overall design of Zookeeper provides for a highly available system consisting of znodes that make up a hierarchical namespace. The following is a representation of znodes:

Zookeeper can be run as a single server in standalone mode or on a cluster of machines in replicated mode, called an ensemble. High availability in replicated mode is achieved by ensuring modifications to the znodes tree are replicated to a majority of the ensemble. If a minority of machines in the ensemble fail, at least one live machine in the ensemble will have the latest state. Let’s consider an example. Suppose we have five machines (A, B, C, D and E) running a Zookeeper ensemble. A majority of the machines, called quorum, need an update. Machines A, C, and E get the update. Now, if a minority of the machines fail, two in this case, the service should continue to function correctly. Let’s say machines A and E fail. Then there’s at least one machine, C which has the latest ...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Zookeeper: Intro

Zookeeper