One hot encoding is a powerful technique for handling categorical data, but it can also increase dimensionality, sparsity, and the risk of overfitting.
If you’re in the field of data science, you’ve probably heard the term “one hot encoding”. Even the Sklearn documentation tells you to “encode categorical integer features using a one-hot scheme”. But, what is one hot encoding, and why do we use it?
Most machine learning tutorials and tools require you to prepare data before it can be fit to a particular ML model. One hot encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. One hot encoding is a crucial part of feature engineering for machine learning.
In this guide, we will introduce you to one hot encoding and show you when to use it in your ML models. We’ll provide some real-world examples with Sklearn and Pandas.
This tutorial at a glance:
Start mastering feature engineering for ML with our hands-on course today.
Feature engineering is a crucial stage in any machine learning project. It allows you to use data to define features that enable machine learning algorithms to work properly. In this course, you will learn the techniques that will help you create new features from existing features. You’ll start by diving into label encoding which is crucial for converting categorical features into numerical. You’ll also learn about other various types of encoding such as: one-hot, count, and mean, all of which are important for feature engineering. In the remaining chapters, you’ll learn about feature interaction and datetime features. In all, this course will show you the many different ways you can create features from existing ones.
Categorical data refers to variables that are made up of label values, for example, a “color” variable could have the values “red,” “blue,” and “green.” Think of values like different categories that sometimes have a natural ordering to them.
Some machine learning algorithms can work directly with categorical data depending on implementation, such as a decision tree, but most require any inputs or outputs variables to be a number, or numeric in value. This means that any categorical data must be mapped to integers.
One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.
Take a look at this chart for a better understanding:
Let’s apply this to an example. Say we have the values red and blue. With one-hot, we would assign red with a numeric value of 0 and blue with a numeric value of 1.
It’s crucial to be consistent when we use these values. This makes it possible to invert our encoding at a later point to get our original categorical back.
Once we assign numeric values, we create a binary vector that represents our numerical values. In this case, our vector will have 2 as its length since we have 2 values. Thus, the red
value can be represented with the binary vector [1,0]
, and the blue
value will be represented as [0,1]
.
One hot encoding is useful for data that has no relationship to each other. Machine learning algorithms treat the order of numbers as an attribute of significance. In other words, they will read a higher number as better or more important than a lower number.
While this is helpful for some ordinal situations, some input data does not have any ranking for category values, and this can lead to issues with predictions and poor performance. That’s when one hot encoding saves the day.
One hot encoding makes our training data more useful and expressive, and it can be rescaled easily. By using numeric values, we more easily determine a probability for our values. In particular, one hot encoding is used for our output values, since it provides more nuanced predictions than single labels.
Manually converting our data to numerical values includes two basic steps:
For the first step, we need to assign each category value with an integer, or numeric, value. If we had the values red, yellow, and blue, we could assign them 1, 2, and 3 respectively.
When dealing with categorical variables that have no order or relationship, we need to take this one step further. Step two involves applying one-hot encoding to the integers we just assigned. To do this, we remove the integer encoded variable and add a binary variable for each unique variable.
Above, we had three categories, or colors, so we use three binary variables. We place the value 1 as the binary variable for each color and the value 0 for the other two colors.
red, yellow, blue
1, 0, 0
0, 1, 0
0, 0, 1
Note: In many other fields, binary variables are referred to as dummy variables.
Start mastering feature engineering for ML with our hands-on course today.
Feature engineering is a crucial stage in any machine learning project. It allows you to use data to define features that enable machine learning algorithms to work properly. In this course, you will learn the techniques that will help you create new features from existing features. You’ll start by diving into label encoding which is crucial for converting categorical features into numerical. You’ll also learn about other various types of encoding such as: one-hot, count, and mean, all of which are important for feature engineering. In the remaining chapters, you’ll learn about feature interaction and datetime features. In all, this course will show you the many different ways you can create features from existing ones.
We don’t have to one hot encode manually. Many data science tools offer easy ways to encode your data. The Python library Pandas provides a function called get_dummies
to enable one-hot encoding.
df_new = pd.get_dummies(df, columns=["col1"], prefix="Planet")
Let’s see this in action.
import pandas as pddf = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Moon", "Venus"]})print("The original data")print(df)print("*" * 30)df_new = pd.get_dummies(df, columns=["col1"], prefix="Planet")print("The transform data using get_dummies")print(df_new)
get_dummies
to do one-hot encoding for a pandas DataFrame
object. The parameter prefix
indicates the prefix of the new column name.Let’s apply this to a practical example. Say we have the following dataset.
import pandas as pd
ids = [11, 22, 33, 44, 55, 66, 77]
countries = ['Seattle', 'London', 'Lahore', 'Berlin', 'Abuja']
df = pd.DataFrame(list(zip(ids, countries)),
columns=['Ids', 'Cities'])
Here we have a Pandas dataframe called df
with two lists: ids
and Cities
. Let’s call the head()
to get this result:
Ids | Cities | |
---|---|---|
0 | 11 | Seattle |
1 | 22 | London |
2 | 33 | Lahore |
3 | 44 | Berlin |
4 | 55 | Abuja |
We see here that the Cities
column contains our categorical values: the names of our cities. We must convert them in our new column Cities
using the get_dummies()
function we discussed above.
y = pd.get_dummies(df.Countries, prefix='City')
print(y.head())
Here, we are passing the value City
for the prefix
attribute of the method get_dummies()
. If we run the code now, we will print our encoded values:
import pandas as pddf = pd.DataFrame({"col1": ["Seattle", "London", "Lahore", "Berlin", "Abuja"]})print("The original data")print(df)print("*" * 30)df_new = pd.get_dummies(df, columns=["col1"], prefix="Cities")print("The transform data using get_dummies")print(df_new)
import sklearn.preprocessing as preprocessingimport numpy as npimport pandas as pdtargets = np.array(["red", "green", "blue", "yellow", "pink","white"])labelEnc = preprocessing.LabelEncoder()new_target = labelEnc.fit_transform(targets)onehotEnc = preprocessing.OneHotEncoder()onehotEnc.fit(new_target.reshape(-1, 1))targets_trans = onehotEnc.transform(new_target.reshape(-1, 1))print("The original data")print(targets)print("The transform data using OneHotEncoder")print(targets_trans.toarray())
LabelEncoder
to convert the string to int on line 7 and line 8.OneHotEncoder
object.fit()
.Note: In the newer version of
sklearn
, you don’t need to convert the string to int, asOneHotEncoder
does this automatically.
Let’s see the OneHotEncoder
class in action with another example. First, here’s how to import the class.
from sklearn.preprocessing import OneHotEncoder
Like before, we first populate our list of unique values for the encoder.
x = [[11, "Seattle"], [22, "London"], [33, "Lahore"], [44, "Berlin"], [55, "Abuja"]]y = OneHotEncoder().fit_transform(x).toarray()print(y)
When we print this, we get the following for our now encoded values:
[[1. 0. 0. 0. 0. 0. 0. 1.]
[0. 1. 0. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 1.]
[0. 0. 0. 1. 0. 0. 1. 0.]
[0. 0. 0. 0. 1. 1. 0. 0.]]
Congrats on making it to the end! You should now have a good idea what one hot encoding does and how to implement it in Python. There is still a lot to learn to master machine learning feature engineering. Your next steps are:
To get introduce to these, check out Educative’s mini course Feature Engineering for Machine Learning. You’ll learn the techniques to create new ML features from existing features. You’ll start by diving into label encoding which is crucial for converting categorical features into numerical. In the remaining chapters, you’ll learn about feature interaction and datetime features.
Happy learning!
Free Resources