Disk Blocks & HDFS Blocks

This lesson talks about the disk blocks, filesystems blocks and HDFS blocks.

Disk Blocks & HDFS Blocks

We discussed a disk block at the start of the chapter. It is the smallest unit of data that can be read from or written to a disk. Usually disk blocks are 512 bytes in size. The filesystem sitting on top of the physical disk works with an abstraction called the filesystem block, not with disk blocks. The filesystem block is often an integral multiple of disk blocks, usually a few kilobytes in size. However, this complexity is hidden from the end users of the filesystem.

HDFS is not a physical filesystem, but rather a virtual abstraction over distributed disk-based file systems. HDFS can’t be browsed like the local filesystem. You need the HDFS shell, the HDFS web UI, or programmatic APIs to do that. The words block and blocksize have a different meaning in HDFS context. Let’s explore them next.

HDFS block

A file in HDFS is logically divided up into HDFS blocks. Each HDFS block is physically made of filesystem blocks of the underlying filesystem, which in turn is an integral multiple of the disk block size.

The benefit of block abstraction for a distributed file system like HDFS is that a file can be larger than any single disk in the cluster. In the latest version of Hadoop, HDFS has a default block size of 128MB. However, if we store a 10MB file, it’ll take up only 10MB of disk space, not 128MB. Storing a 1KB file in HDFS doesn’t imply that on the disk a block of minimum size 128MB is written. HDFS, deals with disk blocks of a much smaller size. Therefore, the disk space used is actually 1KB for a 1KB file. The HDFS block of 128MB doesn’t translate to a unit of storage. Instead, it is an abstraction to store the metadata in Namenode. It is the smallest unit the Namenode can reference in its memory. The underlying physical file system isn’t divided into HDFS block-sized chunks. Let’s explore another example. If we have three files of 10KB each, then the space consumed on disk would be 3 x 10KB = 30KB whereas the Namenode would hold 3 HDFS blocks, one per file in memory.

Get hands-on with 1300+ tech skills courses.