Delve into Big Data essentials, explore data types, and gain insights into Hadoop components like YARN, MapReduce, HDFS, and Spark. Discover foundations to excel in the growing Big Data field.

data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

## Mapper Inputs

### Input splits

Our example demonstrates a simplistic scenario where the input is entirely contained in a single file for the MR job. In reality, the input to a MR job usually consists of several GBs of data. That data is split among multiple map tasks. 

> Each map task works on a unit of data called the __input split__.

 Hadoop divides the MR job input into equal sized chunks. Each map task works on one chunk - the input split. A user can tweak the size of the input split. As a corollary, the number of map tasks spawned for a MR job is equal to the number of input splits.

However, we walk a fine balance when working with input splits. A greater number of input splits means more map tasks for a MR job. In turn, the MR job processes faster because the map tasks  work in parallel on individual splits. However, too many input splits come with a corresponding increase in the overhead for managing splits and creating map tasks. After a point, the management overhead becomes the dominating factor in job execution. Generally, the HDFS block size (default 128 MB) is considered a good split size. The maximum and minimum split sizes can be adjusted through configuration.

Remember that the input split is a logical concept and represented by a Java abstract class [InputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html). The class definition is taken from the code base:

```java
public abstract class InputSplit {
    public InputSplit() {
    }

    public abstract long getLength() throws IOException, InterruptedException;

    public abstract String[] getLocations() throws IOException, InterruptedException;

    @Evolving
    public SplitLocationInfo[] getLocationInfo() throws IOException {
        return null;
    }
}
```

Input split doesn't contain the data, but rather a reference to the data. The MapReduce framework tries to schedule map tasks as close as possible to the split's data.

Consider our car counting example. Each line in the data file represents a record. An input split, comprised of HDFS blocks, are made up of records. It is possible that a single record may span multiple HDFS blocks. There's no guarantee that a single HDFS block will consist of a whole number of records. HDFS has no notion of inside contents of a HDFS block. It cannot determine if a record spills over into another block.  This problem is solved by input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends. In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record. The consequence: the map task processing such an input split may read a remote data block stored on a different node in the cluster in order to completely process the input split.

# Mapper Inputs

## Input splits

Our example demonstrates a simplistic scenario where the input is entirely contained in a single file for the MR job. In reality, the input to a MR job usually consists of several GBs of data. That data is split among multiple map tasks. 

> Each map task works on a unit of data called the __input split__.

 Hadoop divides the MR job input into equal sized chunks. Each map task works on one chunk - the input split. A user can tweak the size of the input split. As a corollary, the number of map tasks spawned for a MR job is equal to the number of input splits.

However, we walk a fine balance when working with input splits. A greater number of input splits means more map tasks for a MR job. In turn, the MR job processes faster because the map tasks  work in parallel on individual splits. However, too many input splits come with a corresponding increase in the overhead for managing splits and creating map tasks. After a point, the management overhead becomes the dominating factor in job execution. Generally, the HDFS block size (default 128 MB) is considered a good split size. The maximum and minimum split sizes can be adjusted through configuration.

Remember that the input split is a logical concept and represented by a Java abstract class [InputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html). The class definition is taken from the code base:

```java
public abstract class InputSplit {
    public InputSplit() {
    }

    public abstract long getLength() throws IOException, InterruptedException;

    public abstract String[] getLocations() throws IOException, InterruptedException;

    @Evolving
    public SplitLocationInfo[] getLocationInfo() throws IOException {
        return null;
    }
}
```

Input split doesn't contain the data, but rather a reference to the data. The MapReduce framework tries to schedule map tasks as close as possible to the split's data.

Consider our car counting example. Each line in the data file represents a record. An input split, comprised of HDFS blocks, are made up of records. It is possible that a single record may span multiple HDFS blocks. There's no guarantee that a single HDFS block will consist of a whole number of records. HDFS has no notion of inside contents of a HDFS block. It cannot determine if a record spills over into another block.  This problem is solved by input splits. When a MapReduce job client calculates the input splits, it figures out where the first whole record in a block begins and where the last record in the block ends. In cases where the last record in a block is incomplete, the input split includes location information for the next block and the byte offset of the data needed to complete the record. The consequence: the map task processing such an input split may read a remote data block stored on a different node in the cluster in order to completely process the input split.

This lesson explains concepts relating to input data for map tasks. 

Mapper Input

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Mapper Input

Mapper Inputs

Input splits