Delve into Big Data essentials, explore data types, and gain insights into Hadoop components like YARN, MapReduce, HDFS, and Spark. Discover foundations to excel in the growing Big Data field.

data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

## The Big Picture

In this lesson, we'll discuss the architecture of HDFS, its goals, and its limitations. The Hadoop Distributed File System (HDFS) was designed with the following goals in mind:

+ __Large files:__ The system should store large files comprising of several hundred gigabytes or petabytes.

+ __Streaming data access:__ HDFS is optimized and built for a ___write-once and read-many-times___ pattern. Having the time to read the entire dataset is more important than the latency in reading the first record. HDFS doesn't support multiple writers. Existing files on the system can only be appended to at the very end. Modifying a file at an arbitrary offset is not possible.

+  __Commodity hardware:__ Hadoop is designed to run on clusters of cheap commodity hardware. It does not require expensive specialized hardware. The chance of hardware failure in such situations is high but the system is expected to continue working correctly. Keeping in line with that view, HDFS is highly fault-tolerant and designed to be deployed on low-cost hardware.

## Working of HDFS

A filesystem, distributed or local, must know the location of the disk blocks making up a file. Then be it can retrieve blocks for a client. Additionally, the filesystem should return any metadata related to the file to the client. These requirements inspire the two software daemons that make up HDFS:

+ Namenode (NN)
+ Datanode (DN)

# The Big Picture

In this lesson, we'll discuss the architecture of HDFS, its goals, and its limitations. The Hadoop Distributed File System (HDFS) was designed with the following goals in mind:

+ __Large files:__ The system should store large files comprising of several hundred gigabytes or petabytes.

+ __Streaming data access:__ HDFS is optimized and built for a ___write-once and read-many-times___ pattern. Having the time to read the entire dataset is more important than the latency in reading the first record. HDFS doesn't support multiple writers. Existing files on the system can only be appended to at the very end. Modifying a file at an arbitrary offset is not possible.

+  __Commodity hardware:__ Hadoop is designed to run on clusters of cheap commodity hardware. It does not require expensive specialized hardware. The chance of hardware failure in such situations is high but the system is expected to continue working correctly. Keeping in line with that view, HDFS is highly fault-tolerant and designed to be deployed on low-cost hardware.

# Working of HDFS

A filesystem, distributed or local, must know the location of the disk blocks making up a file. Then be it can retrieve blocks for a client. Additionally, the filesystem should return any metadata related to the file to the client. These requirements inspire the two software daemons that make up HDFS:

+ Namenode (NN)
+ Datanode (DN)

This lesson gives the reader new perspective on HDFS.

The Big Picture

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

The Big Picture

The Big Picture