Delve into Big Data essentials, explore data types, and gain insights into Hadoop components like YARN, MapReduce, HDFS, and Spark. Discover foundations to excel in the growing Big Data field.

data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

## DataFrames

A Dataframe is the most common Structured API. It represents a table with rows and columns. Each column has a defined type maintained in a schema. You can think of the DataFrame as a spreadsheet that is too big to fit on a single machine so it has parts of it spread across a cluster of machines. Even if the spreadsheet can fit onto a single machine, the desired computations take too long so the data has to be chunked and processed on multiple machines in parallel.


Another way to describe DataFrames is to think of them as distributed table-like collections with well-defined rows and columns. Each column has the same type of data across all the rows albeit an absence of value can be indicated by a null. In a sense, DataFrames (and Datasets too) are plans, evaluated lazily, that perform operations on data distributed across various machines in a cluster.

A DataFrame is broken into smaller parts called _partitions_. A partition is a collection of rows from the parent DataFrame that reside on a particular physical machine on the cluster. A DataFrame's partitions represent the data's physical distribution across the cluster of machines. The number of partitions also dictate the parallelism achieved in a Spark job. With a single partition, only a single executor can process the data, even if several hundred are available. Similarly, if there are many partitions with single executor available ,there would be no parallelism.

When working with DataFrames, partitions are never manually or individually manipulated. Instead, the user specifies higher-level data transformations that the Spark framework then applies to all partitions across the cluster.


## Schema

A schema defines the column names and types of a DataFrame. A schema can be manually defined or read-in from the source. Spark allows __schema inference__. Spark reads in a few rows and then parses the types in those rows to map them to _Spark types_. We can also examine the inferred schema for a DataFrame object using the `schema` method.

## Spark types
Spark uses an engine called __Catalyst__ that maintains type information. The Spark types map to corresponding types in supported languages (Java, Python etc). Spark will convert an expression written in one of the supported languages into an equivalent Catalyst representation for the same type. The Catalyst engine applies several optimizations and is continually improved to make executions faster.


## Working with DataFrames

Let's see an example that creates a DataFrame from a text file. The code is below and can be executed in the terminal.

```scala
// Create a DataFrame from a text file and let Spark infer the schema
// The file doesn't contain a header

val df = spark.read.option("inferSchema", true).option("header", false).text("/DataJek/cars.data")
```


The Spark inferred schema can be examined as follows:

```scala
df.schema
```

This feature demonstrates the powerful set of APIs that Spark offers, compared to other execution engines.

The below exercise demonstrates working with DataFrames.

# DataFrames

A Dataframe is the most common Structured API. It represents a table with rows and columns. Each column has a defined type maintained in a schema. You can think of the DataFrame as a spreadsheet that is too big to fit on a single machine so it has parts of it spread across a cluster of machines. Even if the spreadsheet can fit onto a single machine, the desired computations take too long so the data has to be chunked and processed on multiple machines in parallel.


Another way to describe DataFrames is to think of them as distributed table-like collections with well-defined rows and columns. Each column has the same type of data across all the rows albeit an absence of value can be indicated by a null. In a sense, DataFrames (and Datasets too) are plans, evaluated lazily, that perform operations on data distributed across various machines in a cluster.

A DataFrame is broken into smaller parts called _partitions_. A partition is a collection of rows from the parent DataFrame that reside on a particular physical machine on the cluster. A DataFrame's partitions represent the data's physical distribution across the cluster of machines. The number of partitions also dictate the parallelism achieved in a Spark job. With a single partition, only a single executor can process the data, even if several hundred are available. Similarly, if there are many partitions with single executor available ,there would be no parallelism.

When working with DataFrames, partitions are never manually or individually manipulated. Instead, the user specifies higher-level data transformations that the Spark framework then applies to all partitions across the cluster.


# Schema

A schema defines the column names and types of a DataFrame. A schema can be manually defined or read-in from the source. Spark allows __schema inference__. Spark reads in a few rows and then parses the types in those rows to map them to _Spark types_. We can also examine the inferred schema for a DataFrame object using the `schema` method.

# Spark types
Spark uses an engine called __Catalyst__ that maintains type information. The Spark types map to corresponding types in supported languages (Java, Python etc). Spark will convert an expression written in one of the supported languages into an equivalent Catalyst representation for the same type. The Catalyst engine applies several optimizations and is continually improved to make executions faster.


# Working with DataFrames

Let's see an example that creates a DataFrame from a text file. The code is below and can be executed in the terminal.

```scala
// Create a DataFrame from a text file and let Spark infer the schema
// The file doesn't contain a header

val df = spark.read.option("inferSchema", true).option("header", false).text("/DataJek/cars.data")
```


The Spark inferred schema can be examined as follows:

```scala
df.schema
```

This feature demonstrates the powerful set of APIs that Spark offers, compared to other execution engines.

The below exercise demonstrates working with DataFrames.

DataFrames

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

DataFrames

DataFrames