Conducting Hypothesis Tests
Learn and understand the conduct of hypothesis tests.
Let’s learn how to seamlessly modify our previously seen infer
code for constructing confidence intervals to conduct hypothesis tests. The basic outline of the workflow is almost identical, except for an additional hypothesize()
step between the specify()
and generate()
steps, as seen in the figure below.
The infer
package workflow
The infer
package workflow uses function names that are intuitively named with verbs.
Specify variables
Recall that we use the specify()
verb to specify the response variable and, if needed, any explanatory variables for our study. In this case, because we’re interested in any potential effects of gender on promotion decisions, we set decision
as the response variable and gender
as the explanatory variable. We do this using formula = response ~ explanatory
, where response
is the name of the response variable in the data frame and explanatory
is the name of the explanatory variable. So in our case, it’s decision ~ gender
.
Furthermore, since we’re interested in the proportion of résumés "promoted"
, and not the proportion of résumés not promoted, we set the argument success
to "promoted"
.
promotions %>%specify(formula = decision ~ gender, success = "promoted")
Again, notice how the promotions data itself doesn’t change, but the Response: decision (factor)
and Explanatory: gender (factor)
metadata do. This is similar to how the group_by()
verb from dplyr
doesn’t change the data, but only adds grouping metadata.
2. Hypothesize the null hypothesis
In order to conduct hypothesis tests using the infer
workflow, we need a new step not present for confidence intervals—the hypothesize()
. Recall that our hypothesis test was:
In other words, the null hypothesis infer
workflow using the null
argument of the hypothesize()
function to either "point"
for hypotheses involving a single sample or "independence"
for hypotheses involving two samples.
In our case, because we have two samples (the résumés with male and female names), we set null = "independence"
.
promotions %>%specify(formula = decision ~ gender, success = "promoted") %>% hypothesize(null = "independence")
Again, the data hasn’t changed yet. This will occur at the upcoming generate()
step. Right now, we’re merely setting the metadata.
Where do the terms "point"
and "independence"
come from? These are two technical statistical terms. The term “point” relates from the fact that for a single group of observations, we’ll test the value of a single point. Going back to the pennies example, let’s say we wanted to test if the mean year of all US pennies was equal to 1993 or not. We would be testing the value of a point
The term “independence” relates to the fact that for two groups of observations, we’re testing whether or not the response variable is independent of the explanatory variable that assigns the groups. In our case, we’re testing whether the decision
response variable is independent of the explanatory variable gender
that assigns each résumé to either of the two groups.
3. Generate replicates
After we hypothesize()
the null hypothesis, we generate()
replicates of shuffled datasets assuming the null hypothesis is true. We do this by repeating the shuffling exercise we performed several times. Instead of merely doing it 16 times as our groups of friends did, let’s use the computer to repeat this 1000 times by setting reps = 1000
in the generate()
function. We’ll now perform shuffles by setting type = "permute"
. This is unlike for confidence intervals where we generated replicates using type = "bootstrap"
resampling with replacement. Recall that shuffles are a kind of resampling, but unlike the bootstrap method, they involve resampling without replacement.
promotions_generate <- promotions %>%specify(formula = decision ~ gender, success = "promoted") %>% hypothesize(null = "independence") %>%generate(reps = 1000, type = "permute")nrow(promotions_generate)
Observe that the resulting data frame has 48,000 rows. This is because we performed shuffles for each of the 48 rows 1,000 times, and 48,000 = 1000 * 48. If we explore the promotions_generate
data frame with print()
, we’ll notice that the variable replicate
indicates which resample each row belongs to. So it has the value 1
48 times, the value 2
48 times, all the way through to the value 1000
48 times.
4. Calculate summary statistics
Now that we’ve generated 1,000 replicates of shuffles, assuming the null hypothesis is true, let’s calculate()
the appropriate summary statistic for each of our 1,000 shuffles. Point estimates related to hypothesis testing have a specific name that’s test statistics. The test statistic here is the difference in sample proportions