Regression Refresher

Learn about evaluation analysis and sampling scenarios in inference for regression.

Needed packages

Let’s load all the packages needed for this chapter. Loading the tidyverse package by running library(tidyverse) loads the following commonly used data science packages all at once:

  • ggplot2: This is for data visualization.

  • dplyr: This is for data wrangling.

  • tidyr: This is for converting data to the tidy format.

  • readr: This is for importing spreadsheet data into R.

  • purrr, tibble, stringr, and forcats: These are the more advanced packages.

Press + to interact
library(tidyverse)
library(moderndive)
library(infer)

Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis.

Teaching evaluations analysis

Using simple linear regression, we modeled the relationship between:

  • A numerical outcome variable y (the instructor’s teaching score)

  • A single numerical explanatory variable x (the instructor’s beauty score)

We first created an evals_ch5 data frame that selected a subset of variables from the evals data frame included in the moderndive package. This evals_ch5 data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching score and the beauty rating bty_avg:

Press + to interact
evals_ch5 <- evals %>%
select(ID, score, bty_avg, age)
glimpse(evals_ch5)

We performed an exploratory data analysis of the relationship between the two variables score and bty_avg. We saw there that a weakly positive correlation of 0.187 existed between the two variables.

This is evidenced in the figure below of the scatterplot along with the best-fitting regression line that summarizes the linear relationship between the two variables score and bty_avg. We defined a best-fitting line as the line that minimizes the sum of the squared residuals.

Press + to interact
ggplot(evals_ch5,
aes(x = bty_avg, y = score)) +
geom_point() +
labs(x = "Beauty Score",
y = "Teaching Score",
title = "Relationship between teaching and beauty scores") + geom_smooth(method = "lm", se = FALSE)

Looking at this plot again, the following questions might be asked: Does that line really have all that positive of a slope? It does increase from left to right as the bty_avg variable increases, but by how much?

To get to this information, recall that we followed a two-step procedure:

  1. We first fit the linear regression model using the lm() function with the formula score ~ bty_avg and save this model in score_model.

  2. We get the regression table by applying the get_regression_table() function from the moderndive package to score_model.

Press + to interact
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
# Get regression table:
get_regression_table(score_model)

Using the values in the estimate column of the resulting regression table, we can then obtain the equation of the best-fitting regression line in the figure above:

Here, 𝑏0𝑏_0 is the fitted intercept and 𝑏1𝑏_1 ...