Regression Refresher
Learn about evaluation analysis and sampling scenarios in inference for regression.
We'll cover the following...
Needed packages
Let’s load all the packages needed for this chapter. Loading the tidyverse
package by running library(tidyverse)
loads the following commonly used data science packages all at once:
ggplot2
: This is for data visualization.dplyr
: This is for data wrangling.tidyr
: This is for converting data to the tidy format.readr
: This is for importing spreadsheet data into R.purrr
,tibble
,stringr
, andforcats
: These are the more advanced packages.
library(tidyverse)library(moderndive)library(infer)
Before jumping into inference for regression, let’s remind ourselves of the University of Texas Austin teaching evaluations analysis.
Teaching evaluations analysis
Using simple linear regression, we modeled the relationship between:
A numerical outcome variable y (the instructor’s teaching score)
A single numerical explanatory variable x (the instructor’s beauty score)
We first created an evals_ch5
data frame that selected a subset of variables from the evals
data frame included in the moderndive
package. This evals_ch5
data frame contains only the variables of interest for our analysis, in particular the instructor’s teaching score and the beauty rating bty_avg
:
evals_ch5 <- evals %>%select(ID, score, bty_avg, age)glimpse(evals_ch5)
We performed an exploratory data analysis of the relationship between the two variables score
and bty_avg
. We saw there that a weakly positive correlation of 0.187 existed between the two variables.
This is evidenced in the figure below of the scatterplot along with the best-fitting regression line that summarizes the linear relationship between the two variables score
and bty_avg
. We defined a best-fitting line as the line that minimizes the sum of the squared residuals.
ggplot(evals_ch5,aes(x = bty_avg, y = score)) +geom_point() +labs(x = "Beauty Score",y = "Teaching Score",title = "Relationship between teaching and beauty scores") + geom_smooth(method = "lm", se = FALSE)
Looking at this plot again, the following questions might be asked: Does that line really have all that positive of a slope? It does increase from left to right as the bty_avg
variable increases, but by how much?
To get to this information, recall that we followed a two-step procedure:
We first fit the linear regression model using the
lm()
function with the formulascore ~ bty_avg
and save this model inscore_model
.We get the regression table by applying the
get_regression_table()
function from themoderndive
package toscore_model
.
# Fit regression model:score_model <- lm(score ~ bty_avg, data = evals_ch5)# Get regression table:get_regression_table(score_model)
Using the values in the estimate
column of the resulting regression table, we can then obtain the equation of the best-fitting regression line in the figure above:
Here,