Writing and Reading

This lesson explains the write and read pipelines in HDFS.

Writing and Reading

We’ll now study the interactions between a client application and HDFS when reading or writing files.

Write path

A client initiates the write process. A client could be an application using the Java API or a person working with the hdfs command line utility. The flow of this interaction between client and HDFS goes like this:

  • A client buffers data on the local disk initially. It waits for one HDFS-block worth of data to accumulate before contacting the Namenode.

  • The Namenode, once contacted by the client, verifies if the file exists and that the client has the required permissions to create that file. If these checks pass, the Namenode makes a corresponding change in its namespace. It then returns the client a list of DataNodes to write to. These DataNodes host the blocks (and their replicas) that make up the file.

  • Upon receiving the list from the Namenode, the client starts writing to the first DataNode.

  • That first DataNode receives data from the client in portions. It receives the first portion, writes it to its local repository, and then starts transferring that portion to the second DataNode in the list.

  • The second DataNode receives data from the first, writes to its local repository and starts transferring that portion to the third DataNode in the list.

  • A pipeline of data transfer is formed from the client to all the involved DataNodes. A DataNode can, at the same time, receive and transfer data.

Below is a pictorial representation of how data is written in HDFS.

Get hands-on with 1300+ tech skills courses.