...
/Case Study: Are Action or Romance Movies Rated Higher?
Case Study: Are Action or Romance Movies Rated Higher?
Get an overview of a case study of the movies that are highly rated between action and romance.
Let’s apply our knowledge of hypothesis testing to answer the question, “Are action or romance movies rated higher on IMDb?” IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.
IMDb ratings data
The movies
dataset in the ggplot2movies
package contains information on 58,788 movies that have been rated by users of IMDb.
movies
We’ll focus on a random sample of 68 movies that are classified as either action or romance movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, because the original movies
dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample
data frame included in the moderndive
package.
movies_sample
The variables include the title
and year
the movie was filmed. Furthermore, we have a numerical variable rating
, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action
or Romance
movie. We’re interested in whether Action
or Romance
movies got a higher rating
on average.
Let’s perform an exploratory data analysis of this data. Recall that a boxplot can be a visualization to show the relationship between a numerical and a categorical variable. Another option will be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in the figure below.
ggplot(data = movies_sample, aes(x = genre, y = rating)) + geom_boxplot() +labs(y = "IMDb rating")
Observing the figure from the code above, romance movies have a higher median rating. Do we have reason to believe, however, that there’s a significant difference between the mean rating
for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.
However, there’s a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.
Let’s calculate some summary statistics split by the binary categorical variable genre
—the number of movies, the mean rating, and the standard deviation split by genre
. We’ll do this using dplyr
data wrangling verbs. Notice, in particular, how we count the number of each type of movie using the n()
summary function.
movies_sample %>%group_by(genre) %>%summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))
Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. However, the question is, are these results indicative of a true difference for all romance and action movies? Or can we attribute this difference to chance sampling variation?
Sampling scenario
Let’s now revisit this study in terms of terminology and notation related to sampling. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample
dataset.
This sample was randomly taken from the population movies
, so it’s representative of all romance and action movies on IMDb. Therefore, any analysis and results based on movies_sample
can be generalized to the entire population. What are the relevant population parameters and point estimates? We introduce four sampling scenarios in the table below:
Scenarios of Sampling for Inference
Scenario | Population Parameter | Notation | Point Estimate | Symbol(s) |
1 | Population proportion | p | Sample proportion | p̂ |
2 | Population mean | µ | Sample mean | x̄ or ^μ |
3 | Difference in population proportions | p1- p2 | Difference in sample proportions | p̂1 - p̂2 |
4 | Difference in population means | µ1-µ2 | Difference in sample means | x̄1 - x̄2 or ^μ1 - ^μ2 |
So, whereas the sampling bowl exercise was concerned with proportions, the pennies exercise was concerned with means, the case study on whether yawning is contagious and the promotions activity were concerned with differences in proportions, we’re now concerned with differences in means.
In other words, the population parameter of interest is the difference in population mean ratings