What is Pyspark and how to use it?

Pyspark example on databricks

Part 1: Load & Transform Data

In the first stage, we load some distributed data and read that data as a RDDResilient Distributed Dataset, do some transformations on that RDD, and construct a Spark DataFrame from that RDD and register it as a table.

1.1. List files

Files can be listed on a distributed file system (DBFS, S3, or HDFS) using %fs commands.

We are using data files stored in DBFSDatabricks File System at dbfs:/databricks-datasets/songs_pk/data_001 for this example.

DBFS is the system that strengthens AWS S3 and the SSD drives attached to Spark clusters hosted in AWS.

When approaching a file, it first checks if the file is cached in the SSDSolid State Drive. If it is not available, then it goes out to the specific S3 bucket to get the file(s) %fs ls /databricks-datasets/songs_pk/data_001/.


# Divide the header by its separator
header = sc.textFile(&quot;/databricks-datasets/songs_pk/data_001/header.txt&quot;)
    .map(lambda line: line.split(&quot;:&quot;)).collect()
# Create the Python function
def parse_Line(line):
    tokens = zip(line.split(&quot;\t&quot;), header)
    parsed_tokens = []
    
    for token in tokens:
        token_type = token[1][1]
        if token_type == & # 39; double& # 39;:
            parsed_tokens.append(float(token[0]))
        elif token_type == & # 39; int& # 39;:
            parsed_tokens.append(-1 if &# 39; -& # 39; in token[0] else int(token[0]))
        else:
            parsed_tokens.append(token[0])
    return parsed_tokens

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design

What is Pyspark and how to use it?

What is Pyspark?

Understanding Spark’s features

1. Spark SQL & DataFrame

2. Streaming

3. MLlib

4. Spark Core

5. DataBricks

Pyspark example on databricks

Part 1: Load & Transform Data

1.1. List files

1.2. Display contents of the header

1.3. Examine a data file

1.4. Create python function to parse fields

1.5. Convert header structure

1.6. Create a DataFrame

1.7. Create a temp table

1.8. Cache the table

1.9. Query the data

Part 2: Explore & Visualize the Data

2.1. Display table schema

2.2. Get table rows count

2.3. Visualize a data point: Song duration changes with time