How to Ingest Files: Part I

Learn how to ingest information from files, in several formats, using Spark.

We'll cover the following...

Ingesting data from files in Spark
Ingesting complex CSV files
Ingesting JSON files

Ingesting data from files in Spark

Ingesting data is the first stage of a Big Data pipeline, and in many cases, it determines the initial step in a typical batch process. Spark offers developers a wide range of options when it comes to ingesting files from different formats.

The formats studied in this lesson are exemplified with a single project. Having a separate project for each of them would be overkill, and the operations are fairly concise.

Spark fulfills the loading functionality internally by using parsers, so let’s highlight the essential steps of using a parser:

The input of a parser is the path of a file. This path can also be a regular expression, which enables the developer to load multiple files at once.
Parsers take options as extra arguments, which we showcase in our examples, but the values for these options are case sensitive, so “myPath” and “mypath” are not considered the same.

The below widget contains the project for this lesson:

1.Course Introduction

2.Spark Introduction and Basics

3.Getting Started with Spark

4.DataFrame Basic Operations

5.DataFrame Advanced Operations

6.Spark SQL and Other Functionalities

7.Building a Big Data Batch Application

8.Deployment and Cluster Execution

9.Monitoring and Performance Fundamentals

10.Conclusion

11.Apendix

How to Ingest Files: Part I

Ingesting data from files in Spark

Ingesting complex CSV files