Label encoding in Python

Label encoding is a data preprocessing technique used in machine learning projects that converts categorical columns into numerical values. It plays a significant role at times when we need to fitTraining a machine learning model on a dataset. our data into a machine-learning model that only takes numerical values.

In this Answer, we will explore the implementation of converting categorical data present in strings into numerical values using scikit-learn LaberEncoder class.

Example

Before getting into the coding part and using scikit-learn, let us first understand the result of performing label encoding on a dataset. For that, let us consider an example dataset of fruits along with their prices. The dataset is shown below:

Dataset

Fruit

Price ($)

Apple

2

Banana

3

Orange

4

Banana

3

Apple

2

As we can see, the dataset contains two columns; "Fruit" and "Price ($)." If we want to fit this dataset on a machine learning model, we would need to apply label encoding to it. The result of applying label encoding will be:

Label encoded dataset

Fruit

Price ($)

0

2

1

3

2

4

1

3

0

2

The output shows that the values of the "Fruit" column have converted into numerical values starting from 0. The numerical values assigned are not random. Rather, label encoding is based on assigning values in alphabetical order.

Encoding a column

Now, we will look into the implementation of encoding a dataset's column. We will create a data frame of the above-given data and encode its "Fruit" column. We can see the code below:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

fruits = pd.DataFrame({
    'Fruits': ["Apple", "Banana", "Orange", "Banana" , "Apple"],
    'Price ($)' : ["2", "3", "4", "3", "2"]
})
encoder = LabelEncoder()
encoded_col = encoder.fit_transform(fruits["Fruits"])
fruits['Fruits'] = encoded_col
print(fruits)
Code example for label encoding in Python

Once we click the "Run" button, the data set's column gets encoded, which is now perfect to be fitted on a machine learning model that only takes numerical values.

Note: Sklearn's label encoding module encodes only a single column at a time.

The explanation of the above code is explained below:

  • Line 1: We import pandas library, which is used to create the DataFrame.

  • Line 2: We import the LabelEncoder from the sklearn.preprocessing package.

  • Lines 4–7: We create a DataFrame with the example data we have created in the above sections.

  • Line 8: We create an instance of the LabelEncoder class and store it in encoder variable.

  • Line 9: We use the fit_transform method of the encoder object and pass the 1-dimensional array which is to be encoded. We store the encoded array in the encoded_col variable.

  • Line 10: We replace the Fruits column data with the encoded_col data.

  • Line 11: We display the updated data frame with label encoded column.

Limitation

The limitation of label encoding is that as it converts categorical columns into numerical ones by assigning numbers starting from 0, this may cause priority issues as the column with a higher number will be considered to have a higher priority than a number having lower numerical values.

As an example, in our example data set, Apple is encoded to 0, and Orange is encoded to 2. But there is no priority relation between the two fruits.

Conclusion

In conclusion, label encoding does have limitations. Still, it is a vital tool to pre-process the data and make it perfect to fit it on a machine learning model that only takes numerical values.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved