Apache Pig is a tool that reduces the complexity of writing a MapReduce
program. It is used to analyze large data sets and represent them as data flows. These large data sets consist of a high-level language for expressing data analysis programs. All data manipulation operations are carried out with Hadoop.
Pig Latin is a high-level language provided by Apache Pig for writing data analysis programs. This high-level language also provides methods for writing, reading, and processing data in data analysis programs.
Pig Latin scripts are converted into Map
and Reduce
tasks with the aid of a component in Pig called Pig Engine.
The components of Apache Pig that process the Pig Latin language through multiple layers are:
Parser: The parser accepts a program submitted by the user and performs a syntax check and type check. The output of this operation is a DAG
that contains Pig Latin statements and logical operators.
Optimizer: This step pushes the DAG
to a logical optimizer for logical optimization.
Compiler: This is the compilation step where the optimized logical plan is compiled into MapReduce
jobs.
Execution Engine: In this final step, the MapReduce
jobs are submitted to Hadoop for execution. The desired data is sent to the user on completion.
joins
, filter
, ordering
etc. can be carried out easily.MapReduce
.Apache Pig has the following features:
It is extensible. Users can create their own functions for special-purpose processing like reading and writing data.
It supports a large range of data types and analyzes all kinds of data, both structured and unstructured.
It provides support for user-defined functions where users can create functions in other programming languages such as Java.
It supports automatic optimization so the users only need to focus only on the semantics of the language.