Mean encoding is a technique for transforming categorical variables into numerical values based on the mean of the target variable. It’s particularly useful in classification problems, where the goal is to predict a target variable based on input features.
Consider a dataset like the one below with a categorical feature “City” and a binary target variable “Churn” indicating whether a customer churned.
CustomerID | City | Churn |
1 | New York | 0 |
2 | Paris | 1 |
3 | New York | 1 |
4 | Tokyo | 1 |
5 | Paris | 1 |
Step 1: Group by category (City)
Group by City
to get mean values for Churn
.
Step 2: Calculate mean
Calculate the mean of Churn
for each city.
Mean(Churn
) for New York = (0 + 1) / 2 = 0.5
Mean(Churn
) for Paris = (1 + 1) / 2 = 1
Mean(Churn
) for Tokyo = 1
Step 3: Replace categories
Replace the City
values with their mean Churn
values.
CustomerID | City (Mean Encoded) | Churn |
1 | 0.5 | 0 |
2 | 1 | 1 |
3 | 0.5 | 1 |
4 | 1 | 1 |
5 | 1 | 1 |
In this way, the City
variable is encoded with the mean Churn
values, providing a numerical representation of the categorical feature for machine learning models.
Now, we will look at implementing mean encoding in Python.
The first step is to import the required libraries.
import pandas as pd
In this step, we will create a simple DataFrame. We can also import our dataset.
data = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],'Churn': [0, 1, 1, 1, 1]}df = pd.DataFrame(data)
We will define a function (mean_encode
) to perform mean encoding on a given data frame for a specified categorical and target feature.
def mean_encode(df, cat_feature, target_feature):mean_encoding = df.groupby(cat_feature)[target_feature].mean()df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)
Now we will apply the mean_encode
function to the data frame, specifying the categorical and target features.
mean_encode(df, 'City', 'Churn')
The following code shows how we can implement mean encoding in Python:
import pandas as pd# Given DataFramedata = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],'Churn': [0, 1, 1, 1, 1]}df = pd.DataFrame(data)def mean_encode(df, cat_feature, target_feature):mean_encoding = df.groupby(cat_feature)[target_feature].mean()df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)mean_encode(df, 'City', 'Churn')print(df.head())
Line 1: Import the pandas
library for data manipulation.
Lines 4–6: Create a sample data frame (df
). The data frame has two columns: City
and Churn
.
Lines 8–10: This code computes the mean of the target_feature
grouped by unique values in the cat_feature
column and creates a new column by mapping these mean values to corresponding categories in the data frame df
.
Line 12: Here we call the mean_encode
function with the data frame df
, specifying City
as the categorical feature and Churn
as the target variable.
Mean encoding in Python is a technique used to represent categorical variables numerically by replacing their values with the mean of the target variable. This helps capture relationships between categories and the target variable, making it useful for machine learning models.
Free Resources