Label encoding is a data preprocessing technique used in machine learning projects that converts categorical columns into numerical values. It plays a significant role at times when we need to
In this Answer, we will explore the implementation of converting categorical data present in strings into numerical values using scikit-learn LaberEncoder
class.
Before getting into the coding part and using scikit-learn, let us first understand the result of performing label encoding on a dataset. For that, let us consider an example dataset of fruits along with their prices. The dataset is shown below:
Fruit | Price ($) |
Apple | 2 |
Banana | 3 |
Orange | 4 |
Banana | 3 |
Apple | 2 |
As we can see, the dataset contains two columns; "Fruit" and "Price ($)." If we want to fit this dataset on a machine learning model, we would need to apply label encoding to it. The result of applying label encoding will be:
Fruit | Price ($) |
0 | 2 |
1 | 3 |
2 | 4 |
1 | 3 |
0 | 2 |
The output shows that the values of the "Fruit" column have converted into numerical values starting from 0. The numerical values assigned are not random. Rather, label encoding is based on assigning values in alphabetical order.
Now, we will look into the implementation of encoding a dataset's column. We will create a data frame of the above-given data and encode its "Fruit" column. We can see the code below:
import pandas as pd from sklearn.preprocessing import LabelEncoder fruits = pd.DataFrame({ 'Fruits': ["Apple", "Banana", "Orange", "Banana" , "Apple"], 'Price ($)' : ["2", "3", "4", "3", "2"] }) encoder = LabelEncoder() encoded_col = encoder.fit_transform(fruits["Fruits"]) fruits['Fruits'] = encoded_col print(fruits)
Once we click the "Run" button, the data set's column gets encoded, which is now perfect to be fitted on a machine learning model that only takes numerical values.
Note: Sklearn's label encoding module encodes only a single column at a time.
The explanation of the above code is explained below:
Line 1: We import pandas
library, which is used to create the DataFrame
.
Line 2: We import the LabelEncoder
from the sklearn.preprocessing
package.
Lines 4–7: We create a DataFrame
with the example data we have created in the above sections.
Line 8: We create an instance of the LabelEncoder
class and store it in encoder
variable.
Line 9: We use the fit_transform
method of the encoder
object and pass the 1-dimensional array which is to be encoded. We store the encoded array in the encoded_col
variable.
Line 10: We replace the Fruits
column data with the encoded_col
data.
Line 11: We display the updated data frame with label encoded column.
The limitation of label encoding is that as it converts categorical columns into numerical ones by assigning numbers starting from 0, this may cause priority issues as the column with a higher number will be considered to have a higher priority than a number having lower numerical values.
As an example, in our example data set, Apple is encoded to 0, and Orange is encoded to 2. But there is no priority relation between the two fruits.
In conclusion, label encoding does have limitations. Still, it is a vital tool to pre-process the data and make it perfect to fit it on a machine learning model that only takes numerical values.
Free Resources