Data preprocessing is a process that takes raw data and transforms it into a format that can be understood and analyzed by computers. It’s one of the crucial steps to perform while working with data. It transforms the raw form data into a form suitable for data analysis or as required by a specific machine learning model.
In this Answer, we’ll learn to perform data scaling and normalization using Polars so that the data can be scaled and normalized to produce desired outcomes.
Scaling refers to transforming data to a specified range to ensure that all features are on the same scale or equal footing for further analysis.
Let’s perform min-max scaling for randomly generated data using Polars in Python. Min-max scaling is the simplest method of rescaling the range of feature values to the [0, 1] range. The general formula for a min-max scaling is given as
import numpy as npimport polars as pl# Creating a Polars DataFrame of random numberspl_df = pl.DataFrame(np.random.randint(0, 100, size=25))print(pl_df)# Using Polars to scale the data to range [0, 1]scaled_pl_df = pl_df.select((pl.all()-pl.all().min()) / (pl.all().max()-pl.all().min()))print(scaled_pl_df)
In the given code:
Lines 1–2: We import the numpy
library as np
and the polars
library as pl
.
Line 5: We make a Polars DataFrame
called pl_df
by using NumPy’s random.randint
function to create an array of 25 random whole numbers between 0 and 100 (excluding 100).
Line 6: We print pl_df
to display the randomly generated data.
Line 9: We perform a scaling operation on pl_df
using Polars. The operation calculates scaled values for each element in the DataFrame
to transform the data to the
pl.all()
retrieves all the elements in the DataFrame
.
pl.all().min()
finds the minimum value among all the elements in the DataFrame
.
pl.all().max()
finds the maximum value among all the elements in the DataFrame
.
(pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min())
scales each element by subtracting the minimum value from it and dividing that by the difference between the maximum and minimum values.
Line 10: We print scaled_pl_df
, which will display the scaled values of the original random data, ensuring that they fall within the
Normalization refers to reshaping the data distribution to ensure a uniform scale for data while preserving the range, making the data comparison and analysis easier on the same grounds.
Let’s normalize data by employing the z-score technique on the randomly generated data using Polars in Python as follows:
import numpy as npimport polars as pl# Creating a Polar's DataFrame of random numberspl_df = pl.DataFrame(np.random.randint(0, 100, size=25))print(pl_df)# Using Polars to normalize the data, having a mean of 0 and a standard deviation of 1normalized_pl_df = pl_df.select((pl.all()-pl.all().mean()) / pl.all().std())print(normalized_pl_df)
In the given code:
Line 9: We perform a normalization operation on pl_df
using Polars. The operation calculates the normalized values for each element in DataFrame
to transform the data so that it has a mean of
pl.all()
retrieves all the elements in the DataFrame
.
pl.all().mean()
calculates the mean value of all the elements in the DataFrame
.
pl.all().std()
calculates the standard deviation of all the elements in the DataFrame
.
(pl.all() - pl.all().mean()) / pl.all().std()
normalizes each element by subtracting the mean from it and dividing that by the standard deviation.
Line 10: We print normalized_pl_df
, which will display the normalized values of the original random data, ensuring that they have a mean of
In conclusion, scaling and normalizing data using Polars in Python is simple and allows us to prepare data for machine learning or statistical analysis. Polars provides practical solutions for data professionals working on data preprocessing for predictive modeling or exploratory data analysis.
Free Resources