Data-Centric Statistical Inference Using R and Tidyverse/

...

Case Study: Are Action or Romance Movies Rated Higher?

Get an overview of a case study of the movies that are highly rated between action and romance.

We'll cover the following...

IMDb ratings data
Sampling scenario
Conducting the hypothesis test

Press + to interact

The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We’re interested in whether Action or Romance movies got a higher rating on average.

Let’s perform an exploratory data analysis of this data. Recall that a boxplot can be a visualization to show the relationship between a numerical and a categorical variable. Another option will be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in the figure below.

Press + to interact

Observing the figure from the code above, romance movies have a higher median rating. Do we have reason to believe, however, that there’s a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.

However, there’s a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.

Let’s calculate some summary statistics split by the binary categorical variable genre—the number of movies, the mean rating, and the standard deviation split by genre. We’ll do this using dplyr data wrangling verbs. Notice, in particular, how we count the number of each type of movie using the n() summary function.

Press + to interact

Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. However, the question is, are these results indicative of a true difference for all romance and action movies? Or can we attribute this difference to chance sampling variation?

Sampling scenario

Let’s now revisit this study in terms of terminology and notation related to sampling. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset.

This sample was randomly taken from the population movies, so it’s representative of all romance and action movies on IMDb. Therefore, any analysis and results based on movies_sample can be generalized to the entire population. What are the relevant population parameters and point estimates? We introduce four sampling scenarios in the table below:

So, whereas the sampling bowl exercise was concerned with proportions, the pennies exercise was concerned with means, the case study on whether yawning is contagious and the promotions activity were concerned with differences in proportions, we’re now concerned with differences in means.

In other words, the population parameter of interest is the difference in population mean ratings $\mu_a - \mu_r$ , where $\mu_a$ is the mean rating of all action movies on IMDb, and similarly, $\mu_r$ is the mean rating of all romance movies. Additionally, the point estimate/sample statistic of interest is the difference in sample means $\bar𝑥_𝑎 − \bar𝑥_𝑟$ , where $\bar𝑥_𝑎$ is the mean rating of the $𝑛_𝑎 = 32$ ...

Scenario	Population Parameter	Notation	Point Estimate	Symbol(s)
1	Population proportion	p	Sample proportion	p̂
2	Population mean	µ	Sample mean	x̄ or ^μ
3	Difference in population proportions	p₁- p₂	Difference in sample proportions	p̂₁- p̂2
4	Difference in population means	µ₁-µ₂	Difference in sample means	x̄₁- x̄₂or ^μ₁- ^μ₂

Getting Started with Data in R

Data Visualization

Data Wrangling

Data Importing and “Tidy” Data

Basic Regression

Multiple Regression

Statistical Inference with the infer Package

Bootstrapping and Confidence Intervals

Hypothesis Testing

Inference for Regression

Price Prediction With Regression Analysis in R

Tell a Story with Data

Appendix

Uber Data Analysis Using the R Language

Case Study: Are Action or Romance Movies Rated Higher?

IMDb ratings data

Sampling scenario

Scenarios of Sampling for Inference