How to deal with missing data in Koalas

Koalas is an important package when dealing with Data Science and Big data in Python. It has a simple mechanism.

Koalas implements the pandas’ DataFrame API on top of Apache Spark. This makes life easier for data scientists who are constantly interacting with Big Data. pandas itself is widely used in the field of Data Science. The only difference between pandas and Spark is that pandas has a single node DataFrame implementation; whereas, Spark is the standard for big data processing.

The Koalas package makes sure that a user can immediately start working with Spark as long as one has experience in working in pandas. It additionally provides a single codebase that works with both Spark and pandas.

Dealing with missing Data

Just like in pandas, there are two ways to deal with Nan values in Koalas DataFrame. They are:

Drop the rows
Fill in missing values

If you are given a data frame that contains some missing values, the data frame is filled with Nan:

koalas_df = ks.DataFrame(
    {'unit': [1, 2, 3, 4, 5, 6],
     'hundred': [100, 200, 300, 400],
     'english': ["one", "two", "three", "four", "five", "six"]},
index=[1, 2, 3, 4, 5, 6])

>> koalas_df

   unit  hundred english
1     1      100     one
2     2      200     two
3     3      300   three
4     4      400    four
5     5      Nan    five
6     6      Nan     six


// drop Nan value fields

// To drop any rows that have missing data.

>> koalas_df.dropna(how='any')

   unit  hundred english
1     1      100     one
2     2      200     two
3     3      300   three
4     4      400    four

// Filling missing data.

>> koalas_df.fillna(value=500)

   unit  hundred english
1     1      100     one
2     2      200     two
3     3      300   three
4     4      400    four
5     5      500    five
6     6      500     six

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design