...
/Linear Regression for One Categorical Explanatory Variable
Linear Regression for One Categorical Explanatory Variable
Perform linear regression for a categorical variable in R and learn the principles behind it.
We'll cover the following...
We introduced simple linear regression that involves modeling the relationship between a numerical outcome variable continent
. Our model won’t yield a best-fitting regression line like it did previously, but rather offsets relative to a baseline for comparison.
As we did before when studying the relationship between teaching scores and beauty scores, let’s output the regression table for this model. Recall that this is done in two steps:
We first fit the linear regression model using the
lm(y ~ x, data)
function and save it inlifeExp_model
.We get the regression table as the code output by applying the
get_regression_table()
function from themoderndive
package tolifeExp_model
.
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)get_regression_table(lifeExp_model)
Let’s once again focus on the values in the term
and estimate
columns. Why are there now five rows? Let’s break them down one-by-one:
The
intercept
row corresponds to the mean life expectancy of countries in Africa, i.e., 54.8 years.The
continentAmericas
row corresponds to countries in the Americas, and the value +18.8 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in the Americas is 54.8 + 18.8 = 73.6.The
continentAsia
row corresponds to countries in Asia, and the value +15.9 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in Asia is 54.8 + 15.9 = 70.7.The
continentEurope
row corresponds to countries in Europe, and the value +22.8 is the same difference in mean life expectancy relative to Africa we displayed. In other words, the mean life expectancy of countries in Europe is 54.8 + 22.8 = 77.6.The
continentOceania
corresponds to countries in Oceania, and the value +25.9 is the same difference in mean life expectancy relative to Africa. In other words, the mean life expectancy of countries in Oceania is 54.8 + 25.9 = 80.7.
To summarize, the five values in the estimate
column correspond to the baseline for the comparison continent Africa (the intercept) as well as four offsets from this baseline for the remaining four continents. These remaining four continents are Americas, Asia, Europe, and Oceania.
We might be asking at this point why Africa has been chosen as the baseline for comparison group. This is because Africa, of the five continents, alphabetically comes first. By default, R arranges factors/categorical variables in alphanumeric order. We can change this baseline group to be another continent if we manipulate the factor levels of the variable continent
using the forcats
package.
Let’s now write the equation for our fitted values:
Don’t worry! Once we understand what all the elements mean, things will simplify greatly. First,
In a statistical modeling context, this is also known as a ...