How to scale and normalize data in Python using Polars

Data preprocessing is a process that takes raw data and transforms it into a format that can be understood and analyzed by computers. It’s one of the crucial steps to perform while working with data. It transforms the raw form data into a form suitable for data analysis or as required by a specific machine learning model.

In this Answer, we’ll learn to perform data scaling and normalization using Polars so that the data can be scaled and normalized to produce desired outcomes.

Scaling

Scaling refers to transforming data to a specified range to ensure that all features are on the same scale or equal footing for further analysis.

Let’s perform min-max scaling for randomly generated data using Polars in Python. Min-max scaling is the simplest method of rescaling the range of feature values to the [0, 1] range. The general formula for a min-max scaling is given as x=(xmin(value))÷(max(value)min(value))x' = (x - min(value)) ÷ (max(value) - min(value)).

import numpy as np
import polars as pl
# Creating a Polars DataFrame of random numbers
pl_df = pl.DataFrame(np.random.randint(0, 100, size=25))
print(pl_df)
# Using Polars to scale the data to range [0, 1]
scaled_pl_df = pl_df.select((pl.all()-pl.all().min()) / (pl.all().max()-pl.all().min()))
print(scaled_pl_df)

In the given code:

  • Lines 1–2: We import the numpy library as np and the polars library as pl.

  • Line 5: We make a Polars DataFrame called pl_df by using NumPy’s random.randint function to create an array of 25 random whole numbers between 0 and 100 (excluding 100).

  • Line 6: We print pl_df to display the randomly generated data.

  • Line 9: We perform a scaling operation on pl_df using Polars. The operation calculates scaled values for each element in the DataFrame to transform the data to the[0,1][0, 1]range. Here’s the breakdown of this operation:

    • pl.all() retrieves all the elements in the DataFrame.

    • pl.all().min() finds the minimum value among all the elements in the DataFrame.

    • pl.all().max() finds the maximum value among all the elements in the DataFrame.

    • (pl.all() - pl.all().min()) / (pl.all().max() - pl.all().min()) scales each element by subtracting the minimum value from it and dividing that by the difference between the maximum and minimum values.

  • Line 10: We print scaled_pl_df, which will display the scaled values of the original random data, ensuring that they fall within the[0,1][0, 1]range.

Normalization

Normalization refers to reshaping the data distribution to ensure a uniform scale for data while preserving the range, making the data comparison and analysis easier on the same grounds.

Let’s normalize data by employing the z-score technique on the randomly generated data using Polars in Python as follows:

import numpy as np
import polars as pl
# Creating a Polar's DataFrame of random numbers
pl_df = pl.DataFrame(np.random.randint(0, 100, size=25))
print(pl_df)
# Using Polars to normalize the data, having a mean of 0 and a standard deviation of 1
normalized_pl_df = pl_df.select((pl.all()-pl.all().mean()) / pl.all().std())
print(normalized_pl_df)

In the given code:

  • Line 9: We perform a normalization operation on pl_df using Polars. The operation calculates the normalized values for each element in DataFrame to transform the data so that it has a mean of00 and a standard deviation of11. Here’s the breakdown of this operation:

    • pl.all() retrieves all the elements in the DataFrame.

    • pl.all().mean() calculates the mean value of all the elements in the DataFrame.

    • pl.all().std() calculates the standard deviation of all the elements in the DataFrame.

    • (pl.all() - pl.all().mean()) / pl.all().std() normalizes each element by subtracting the mean from it and dividing that by the standard deviation.

  • Line 10: We print normalized_pl_df, which will display the normalized values of the original random data, ensuring that they have a mean of 00and a standard deviation of 11.

Conclusion

In conclusion, scaling and normalizing data using Polars in Python is simple and allows us to prepare data for machine learning or statistical analysis. Polars provides practical solutions for data professionals working on data preprocessing for predictive modeling or exploratory data analysis.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved