Disk Blocks & HDFS Blocks

We discussed a disk block at the start of the chapter. It is the smallest unit of data that can be read from or written to a disk. Usually disk blocks are 512 bytes in size. The filesystem sitting on top of the physical disk works with an abstraction called the filesystem block, not with disk blocks. The filesystem block is often an integral multiple of disk blocks, usually a few kilobytes in size. However, this complexity is hidden from the end users of the filesystem.

HDFS is not a physical filesystem, but rather a virtual abstraction over distributed disk-based file systems. HDFS can’t be browsed like the local filesystem. You need the HDFS shell, the HDFS web UI, or programmatic APIs to do that. The words block and blocksize have a different meaning in HDFS context. Let’s explore them next.

HDFS block

A file in HDFS is logically divided up into HDFS blocks. Each HDFS block is physically made of filesystem blocks of the underlying filesystem, which in turn is an integral multiple of the disk block size.

The benefit of block abstraction for a distributed file system like HDFS is that a file can be larger than any single disk in the cluster. In the latest version of Hadoop, HDFS has a default block size of 128MB. However, if we store a 10MB file, it’ll take up only 10MB of disk space, not 128MB. Storing a 1KB ...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Disk Blocks & HDFS Blocks

Disk Blocks & HDFS Blocks

HDFS block