Deep Dive: Categorical Features

Learn about the transformation of categorical features and their significance.

Understanding categorical features

Machine learning algorithms only work with numbers. If your data contains text features, for example, these would require transformation to numbers in some way. We learned in the previous lesson that the data for our case study is, in fact, entirely numerical. However, it’s worth thinking about how it got to be that way. In particular, consider the EDUCATION feature.

This is an example of what is called a categorical feature: you can imagine that as raw data, this column consisted of the text labels “graduate school,” “university,” “high school,” and “others.” These are called the levels of the categorical feature; here, there are four levels. It is only through a mapping, which has already been chosen for us, that this data exists as the numbers 1, 2, 3, and 4 in our dataset. This particular assignment of categories to numbers creates what is known as an ordinal feature, because the levels are mapped to numbers in order. As a data scientist, at a minimum, you need to be aware of such mappings, if you are not choosing them yourself.

What are the implications of this mapping?

It makes some sense that the education levels are ranked, with 1 corresponding to the highest level of education in our dataset, 2 to the next highest, 3 to the next, and 4 presumably including the lowest levels. However, when you use this encoding as a numerical feature in a machine learning model, it will be treated just like any other numerical feature. For some models, this effect may not be desired.

What if a model seeks to find a straight-line relationship between the features and response?

This may seem like an arbitrary question, although later in the course, you will learn the importance of distinguishing between linear and non-linear models. In this section, we will briefly introduce the concept that some models do look for linear relationships between features and the response variable. Whether or not this would work well in the case of the education feature depends on the actual relationship between different levels of education and the outcome we are trying to predict.

Here, we examine two hypothetical cases of synthetic data with ordinal categorical variables, each with 10 levels. The levels measure the self-reported satisfaction of customers visiting a website. The average number of minutes spent on the website for customers reporting each level is plotted on the y-axis. We’ve also plotted the line of best fit in each case to illustrate how a linear model would deal with this data, as shown in the following figure:

Get hands-on with 1200+ tech skills courses.