...

/

Case Study: Are Action or Romance Movies Rated Higher?

Case Study: Are Action or Romance Movies Rated Higher?

Get an overview of a case study of the movies that are highly rated between action and romance.

Let’s apply our knowledge of hypothesis testing to answer the question, “Are action or romance movies rated higher on IMDb?” IMDb is a database on the internet providing information on movie and television show casts, plot summaries, trivia, and ratings. We’ll investigate if, on average, action or romance movies get higher ratings on IMDb.

IMDb ratings data

The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of IMDb.

Press + to interact
movies

We’ll focus on a random sample of 68 movies that are classified as either action or romance movies but not both. We disregard movies that are classified as both so that we can assign all 68 movies into either category. Furthermore, because the original movies dataset was a little messy, we provide a pre-wrangled version of our data in the movies_sample data frame included in the moderndive package.

Press + to interact
movies_sample

The variables include the title and year the movie was filmed. Furthermore, we have a numerical variable rating, which is the IMDb rating out of 10 stars, and a binary categorical variable genre indicating if the movie was an Action or Romance movie. We’re interested in whether Action or Romance movies got a higher rating on average.

Let’s perform an exploratory data analysis of this data. Recall that a boxplot can be a visualization to show the relationship between a numerical and a categorical variable. Another option will be to use a faceted histogram. However, in the interest of brevity, let’s only present the boxplot in the figure below.

Press + to interact
ggplot(data = movies_sample, aes(x = genre, y = rating)) + geom_boxplot() +
labs(y = "IMDb rating")

Observing the figure from the code above, romance movies have a higher median rating. Do we have reason to believe, however, that there’s a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on this plot. The boxplot does show that the median sample rating is higher for romance movies.

However, there’s a large amount of overlap between the boxes. Recall that the median isn’t necessarily the same as the mean either, depending on whether the distribution is skewed.

Let’s calculate some summary statistics split by the binary categorical variable genre—the number of movies, the mean rating, and the standard deviation split by genre. We’ll do this using dplyr data wrangling verbs. Notice, in particular, how we count the number of each type of movie using the n() summary function.

Press + to interact
movies_sample %>%
group_by(genre) %>%
summarize(n = n(), mean_rating = mean(rating), std_dev = sd(rating))

Observe that we have 36 movies with an average rating of 6.322 stars and 32 movies with an average rating of 5.275 stars. The difference in these average ratings is 6.322 - 5.275 = 1.047. So there appears to be an edge of 1.047 stars in favor of romance movies. However, the question is, are these results indicative of a true difference for all romance and action movies? Or can we attribute this difference to chance sampling variation?

Sampling scenario

Let’s now revisit this study in terms of terminology and notation related to sampling. The study population is all movies in the IMDb database that are either action or romance (but not both). The sample from this population is the 68 movies included in the movies_sample dataset.

This sample was randomly taken from the population movies, so it’s representative of all romance and action movies on IMDb. Therefore, any analysis and results based on movies_sample can be generalized to the entire population. What are the relevant population parameters and point estimates? We introduce four sampling scenarios in the table below:

Scenarios of Sampling for Inference

Scenario

Population Parameter

Notation

Point Estimate

Symbol(s)

1

Population proportion

p

Sample proportion

2

Population mean

µ

Sample mean

x̄ or ^μ

3

Difference in population proportions

p1- p2

Difference in sample proportions

1 - p̂2

4

Difference in population means

µ12

Difference in sample means

1 - x̄2 or

1 - ^μ2

So, whereas the sampling bowl exercise was concerned with proportions, the pennies exercise was concerned with means, the case study on whether yawning is contagious and the promotions activity were concerned with differences in proportions, we’re now concerned with differences in means.

In other words, the population parameter of interest is the difference in population mean ratings μaμr\mu_a - \mu_r, where μa\mu_a is the mean rating of all action movies on IMDb, and similarly, μr\mu_r is the mean rating of all romance movies. Additionally, the point estimate/sample statistic of interest is the difference in sample means 𝑥ˉ𝑎𝑥ˉ𝑟\bar𝑥_𝑎 − \bar𝑥_𝑟, where 𝑥ˉ𝑎\bar𝑥_𝑎 ...