Pig: Overview

Explore Pig as a high-level language designed for parallel data processing on Hadoop. Understand how Pig Latin abstracts MapReduce complexities, execution modes including local, MapReduce, Spark, and Tez, and how to write and run scripts. Gain insights into Pig's role, its comparison with SQL, user-defined functions, and the use of debugging commands to optimize data workflows.

We'll cover the following...

Introduction
Execution modes
Example

Pig Latin
User defined functions
SQL vs Pig
Future of Pig

Introduction

Pig is a language for parallel data processing. One analogy is the existence of higher-level programming languages built indirectly on top of assembly-language. Programs can be of equal quality when written in an assembly language as they can be in languages or Java, C++, etc. However, the former requires a great effort. Pig’s purpose is like the purpose of higher level programming languages; i.e. it provides an abstraction over MapReduce and other frameworks for easily expressing data analysis jobs. MapReduce paradigm involves writing a map function followed by a reduce. This can be challenging to implement as a programmer when working with complex workflows such as joins. Pig makes it easy to express a join for a user and by hiding underlying MapReduce complexity from the user.

Pig’s language layer consists of a textual language called Pig Latin used to express dataflows. Pig’s infrastructure layer refers to the environment where Pig Latin programs are executed. It consists of a compiler that produces sequences of Map-Reduce programs run on an execution engine like MapReduce, Spark, or Tez. Pig is not tied to a particular parallel framework, but was first implemented on Hadoop. Originally developed at Yahoo ...

1.Hadoop

2.YARN

3.Map Reduce

4.HDFS

5.Spark

6.Input & Output Formats

7.Misc

8.Quiz

9.Reference: Replication

10.Reference: Partitioning

11.Reference: Transactions

12.Reference: Issues in Distributed Systems

Mock Interview

Pig: Overview

Introduction