Delve into Big Data essentials, explore data types, and gain insights into Hadoop components like YARN, MapReduce, HDFS, and Spark. Discover foundations to excel in the growing Big Data field.

data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

# Map and Reduce

MapReduce is a concatenation of, "map" and "reduce" which aptly describes the two phases it comprises. MapReduce is an implementation of the computing model introduced by Google. Here, data-parallel computations are executed on clusters of unreliable machines by certain systems. These systems automatically provide locality-aware scheduling, fault tolerance, and load balancing. In simpler terms, think of MapReduce similar as a divide and conquer strategy. A huge data set is divided among worker machines. Once processing is complete, the data from each machine is aggregated to present a final solution. The data flow in various phases of a MapReduce job is shown below.

> MapReduce is a programming model used to process large data sets on a cluster of commodity machines by using a distributed algorithm.

For all its strengths, MapReduce is fundamentally a batch processing system and is not suitable for interactive analysis. You can’t run a query and get results back quickly. Queries typically take minutes or more, so it’s best for offline use, when there isn’t a human sitting in the processing loop, waiting for results. Other Big Data tools, such as Spark and Impala may be better suited for interactive requirements. The model consists of two phases:

1. __Map phase:__ Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs.

2. __Reduce phase:__ The reduce phase merges all intermediate values associated with the same intermediate key.

We'll work through an example later to demonstrate each of these steps. At the core, MapReduce is a __divide and conquer__ approach to solving the challenges of big data. Many real-world problems can be expressed in the map and reduce paradigm.


# Characteristics
MapReduce programming model has the following characteristics:

+ __Distributed__: The MapReduce is a distributed framework consisting of clusters of commodity hardware which run __map__ or __reduce__ tasks.

+ __Parallel__: The map and reduce tasks always work in parallel.

+ __Fault tolerant__: If any task fails, it is rescheduled on a different node.

+ __Scalable__: It can scale arbitrarily. As the problem becomes bigger, more machines can be added to solve the problem in a reasonable amount of time; the framework can scale horizontally rather than vertically.

# Input and output
The framework works exclusively with key-value pairs. The input to and output of both map and reduce phases consists of key-value pairs.</span> However, the following restrictions apply on the input and output:

+ __Serializable__: The framework requires both the key and value to be serializable and must implement the `Writeable` interface. 
+ __Comparable__: The framework sorts the output of the map phase before feeding it to the reduce phase. Sorting requires keys comparable to each other and thus are required to implement the `WritableComparable` interface.

# Map phase

Maps are individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be the same type as the input records. A given input key/value pair may map to zero or many output pairs.

# Reduce phase
The intermediate set of key values output by the map phase are reduced to a smaller set of key value pairs by the reducers.


# Example

Let's consider a scenario where you are planning to open up a business in a particular neighborhood. To do that you need an estimate of the buying power of the folks in that particular neighborhood. You decide to count the number of cars passing-by and their make as a proxy to estimate the buying power of the expected footfall. If you have many expensive cars passing-by, you can reasonably expect lots of customers. One way to collect this info would be to position human observers at various streets to compile the results. In fact, not long ago, oil companies used a similar method to judge the suitability of a potential new gas station. The number of passing cars were counted manually over several days to determine if a gas station at a particular location would be profitable.

For this scenario imagine being handed a single file with all the data presented as follows:

```text
Toyota Toyota Mercedes-Benz 
BMW Tesla Porsche
Porsche Tesla GM Volvo ford
```

Each new line shows data for a different day and lists car makes, separated by the space character. To the first line read that on day one the first car seen was a Toyota, the second was also a Toyota, and the third a Mercedes. If this file consists of several gigabytes or petabytes of data, it would be humanly impractical to count the number of each make. Similarly, traditional computing techniques may also fail or take too much time to process data this large.

The solution to this problem can be implemented using the map-reduce paradigm. MapReduce is also effective at solving matrix multiplication, aggregation, and group by SQL queries. Problems that can be solved by MapReduce must lend themselves to parallelization, i.e., the various parts can be solved independently to produce intermediate results that are later merged to arrive at a final solution.

This lesson introduces MapReduce paradigm to the reader.

Basics

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Basics

Map and Reduce