How to scale and normalize data in Python using Polars

Data preprocessing is a process that takes raw data and transforms it into a format that can be understood and analyzed by computers. It’s one of the crucial steps to perform while working with data. It transforms the raw form data into a form suitable for data analysis or as required by a specific machine learning model.

In this Answer, we’ll learn to perform data scaling and normalization using Polars so that the data can be scaled and normalized to produce desired outcomes.

Scaling

Scaling refers to transforming data to a specified range to ensure that all features are on the same scale or equal footing for further analysis.

Let’s perform min-max scaling for randomly generated data using Polars in Python. Min-max scaling is the simplest method of rescaling the range of feature values to the [0, 1] range. The general formula for a min-max scaling is given as $x' = (x - min(value)) ÷ (max(value) - min(value))$ .

In the given code:

Lines 1–2: We import the numpy library as np and the polars library as pl.
Line 5: We make a Polars DataFrame called pl_df by using NumPy’s random.randint function to create an array of 25 random whole numbers between 0 and 100 (excluding 100).
Line 6: We print pl_df to display the randomly generated data.
Line 9: We perform a scaling operation on pl_df using Polars. The operation calculates scaled values for each element in the DataFrame to transform the data to the $[0, 1]$ range. Here’s the breakdown of this operation:
- pl.all() retrieves all the elements in the DataFrame.
- pl.all().min() finds the minimum value among all the elements in the DataFrame.
- pl.all().max() finds the maximum value among all the elements in the DataFrame.
- (pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()) scales each element by subtracting the minimum value from it and dividing that by the difference between the maximum and minimum values.
Line 10: We print scaled_pl_df, which will display the scaled values of the original random data, ensuring that they fall within the $[0, 1]$ range.

Normalization

Normalization refers to reshaping the data distribution to ensure a uniform scale for data while preserving the range, making the data comparison and analysis easier on the same grounds.

Let’s normalize data by employing the z-score technique on the randomly generated data using Polars in Python as follows:

In the given code:

Line 9: We perform a normalization operation on pl_df using Polars. The operation calculates the normalized values for each element in DataFrame to transform the data so that it has a mean of $0$ and a standard deviation of $1$ . Here’s the breakdown of this operation:
- pl.all() retrieves all the elements in the DataFrame.
- pl.all().mean() calculates the mean value of all the elements in the DataFrame.
- pl.all().std() calculates the standard deviation of all the elements in the DataFrame.
- (pl.all() - pl.all().mean()) / pl.all().std() normalizes each element by subtracting the mean from it and dividing that by the standard deviation.
Line 10: We print normalized_pl_df, which will display the normalized values of the original random data, ensuring that they have a mean of $0$ and a standard deviation of $1$ .

Conclusion

In conclusion, scaling and normalizing data using Polars in Python is simple and allows us to prepare data for machine learning or statistical analysis. Polars provides practical solutions for data professionals working on data preprocessing for predictive modeling or exploratory data analysis.

Unlock your potential: Polars in Python series, all in one place!

To continue your exploration of Polars, check out our series of Answers below:

How to scale and normalize data in Python using Polars
Learn how to transform raw data using Python's Polars library to scale it (0-1) and normalize it (mean 0, std 1).
What is DataFrame.clear function in Polars Python?
Learn how to use Polars' DataFrame.clear() to create a null-filled copy, either empty if n=0 or with n null rows.
How to reverse a DataFrame in Polars Python?
Learn how to use Polars, a Rust-based DataFrame library for Python, which offers a reverse() function to efficiently revert DataFrame rows, providing an alternative to pandas.
How to rename the column names in Polars Python?
Learn how to use Polars' rename() function to efficiently rename DataFrame columns using key-value pairs, enhancing data management and processing.
What is Polars library in Python?
Learn how Polars, a fast DataFrame library in Rust for Python, offers high-performance data manipulation and analysis similar to Pandas.
How to concatenate two Dataframes in Polars Python
Learn how Polars, leveraging Rust, offers efficient DataFrame concatenation in Python with the concat() method.
How to perform a transpose of a Python Polars DataFrame
Learn how to use Polars' DataFrame.transpose() to efficiently transpose DataFrames, with options for including headers and custom column names, enhancing data manipulation capabilities.
How to check the polars version in Python
Learn how to ensure the correct Polars version by using pip3 show polars or by printing pl.__version__ in Python.
What is DataFrame.update function in Polars Python?
Learn how to use the update() function in Polars to merge two DataFrames, updating the target with non-null values from the source, and supporting various join strategies.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources