...

/

Linear Regression for One Categorical Explanatory Variable

Linear Regression for One Categorical Explanatory Variable

Perform linear regression for a categorical variable in R and learn the principles behind it.

We'll cover the following...

We introduced simple linear regression that involves modeling the relationship between a numerical outcome variable 𝑦𝑦 and a numerical explanatory variable 𝑥𝑥. In our life expectancy example, we now instead have a categorical explanatory variable continent. Our model won’t yield a best-fitting regression line like it did previously, but rather offsets relative to a baseline for comparison.

As we did before when studying the relationship between teaching scores and beauty scores, let’s output the regression table for this model. Recall that this is done in two steps:

  1. We first fit the linear regression model using the lm(y ~ x, data) function and save it in lifeExp_model.

  2. We get the regression table as the code output by applying the get_regression_table() function from the moderndive package to lifeExp_model.

Press + to interact
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)

Let’s once again focus on the values in the term and estimate columns. Why are there now five rows? Let’s break them down one-by-one:

  • The intercept row corresponds to the mean life expectancy of countries in Africa, i.e., 54.8 years.

  • The continentAmericas row corresponds to countries in the Americas, and the value +18.8 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in the Americas is 54.8 + 18.8 = 73.6.

  • The continentAsia row corresponds to countries in Asia, and the value +15.9 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in Asia is 54.8 + 15.9 = 70.7.

  • The continentEurope row corresponds to countries in Europe, and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed. In other words, the mean life expectancy of countries in Europe is 54.8 + 22.8 = 77.6.

  • The continentOceania corresponds to countries in Oceania, and the value +25.9 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in Oceania is 54.8 + 25.9 = 80.7.

To summarize, the five values in the estimate column correspond to the baseline for the comparison continent Africa (the intercept) as well as four offsets from this baseline for the remaining four continents. These remaining four continents are Americas, Asia, Europe, and Oceania.

We might be asking at this point why Africa has been chosen as the baseline for comparison group. This is because Africa, of the five continents, alphabetically comes first. By default, R arranges factors/categorical variables in alphanumeric order. We can change this baseline group to be another continent if we manipulate the factor levels of the variable continent using the forcats package.

Let’s now write the equation for our fitted values:

Don’t worry! Once we understand what all the elements mean, things will simplify greatly. First, 1A(x)1_\texttt{A}(x) is what’s known in mathematics as an indicator function. It returns only one of two possible values, 0 and 1, where

In a statistical modeling context, this is also known as a ...