Koalas
is an important package when dealing with Data Science and Big data in Python. It has a simple mechanism.
Koalas
implements the pandas’ DataFrame API on top of Apache Spark. This makes life easier for data scientists who are constantly interacting with Big Data. pandas itself is widely used in the field of Data Science. The only difference between pandas and Spark is that pandas has a single node DataFrame implementation; whereas, Spark is the standard for big data processing.
The Koalas
package makes sure that a user can immediately start working with Spark as long as one has experience in working in pandas. It additionally provides a single codebase that works with both Spark and pandas.
Just like in pandas, there are two ways to deal with Nan
values in Koalas
DataFrame. They are:
If you are given a data frame that contains some missing values, the data frame is filled with Nan
:
koalas_df = ks.DataFrame(
{'unit': [1, 2, 3, 4, 5, 6],
'hundred': [100, 200, 300, 400],
'english': ["one", "two", "three", "four", "five", "six"]},
index=[1, 2, 3, 4, 5, 6])
>> koalas_df
unit hundred english
1 1 100 one
2 2 200 two
3 3 300 three
4 4 400 four
5 5 Nan five
6 6 Nan six
// drop Nan value fields
// To drop any rows that have missing data.
>> koalas_df.dropna(how='any')
unit hundred english
1 1 100 one
2 2 200 two
3 3 300 three
4 4 400 four
// Filling missing data.
>> koalas_df.fillna(value=500)
unit hundred english
1 1 100 one
2 2 200 two
3 3 300 three
4 4 400 four
5 5 500 five
6 6 500 six