Koalas
is an important package when dealing with Data Science and Big data in python. Koalas
implements the pandas DataFrame API on top of the Apache Spark using a simple mechanism. This makes life easier for Data Scientists who constantly interact with Big Data.
pandas itself is widely used in the field of Data Science. The only difference between pandas and Spark is that pandas is a single node DataFrame implementation; whereas, Spark is the standard for Big data processing.
The Koalas
package makes sure that a user can immediately start working with Spark as long as they have experience with pandas. Aditionally, it provides a single codebase that works with both Spark and pandas.
There are several ways to create a Koalas
object. Let’s explore them below.
Pass a list containing random values to create a Koalas
series, as shown below:
// import the relevant libraries
import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession
series = ks.Series([1,2,3,4,5,6,7,8])
Create a data frame using a dictionary of key-value pairs by:
koalas_df = ks.DataFrame(
{'unit': [1, 2, 3, 4, 5, 6],
'hundred': [100, 200, 300, 400, 500, 600],
'english': ["one", "two", "three", "four", "five", "six"]})
This will create a koalas DataFrame koalas_df
.
We can convert pandas DataFrame to Koalas
DataFrame as shown:
df = pd.DataFrame(
{'unit': [1, 2, 3, 4, 5, 6],
'hundred': [100, 200, 300, 400, 500, 600],
'english': ["one", "two", "three", "four", "five", "six"]})
// converting to koalas
koalas_df = ks.from_pandas(df)
Free Resources