Task: Netflix Shows and Movies
Practice all the concepts you learned in this course in this project.
Our data and objective
We have data about Netflix movies and shows. The dataset covers various aspects of the movies, like ID, title, production location, release year, genre, TMDB score, and IMDb score.
Our objective is to create visualizations about the overall quality of movies and shows made in different countries.
Note: Determining the objective is crucial because we will filter and clean the data according to our goal.
Here is a preview of the CSV file:
Explore the data
We need to get to know the data before we work on it. This helps us create our strategy and use our time and energy efficiently.
Our initial data exploration involves checking the number of rows and columns, the column names, and the number of null values in the dataset.
data <- read.csv("Documents/movies_shows.csv") # Read the CSV file.print(" ----- The names of the columns in the dataset. -----")colnames(data) # The names of the columns in the dataset.print("----- The number of columns in the dataset. ----- ")ncol(data) # The number of columns in the dataset.print("----- The number of rows in the dataset. ----- ")nrow(data) # The number of rows in the dataset.print("----- The number of null values in the data frame. -----")sum(is.na(data)) # The number of null values in the data frame
-
Line 1: We read the data from our CSV file.
-
Line 3: We check the column names in the data frame so that we can select the columns that are useful for us.
-
Line 5—7: We check the number of rows and columns. It gives us an idea about the size of the data we are dealing with. The size of the data frame is 10 5,850.
-
Line 9: We check the number of null values in the data frame. There are 793 null values in the data frame.
Now, we will check the number of rows without null values by dropping all null values.
no_null <- na.omit(data) # Remove the null values from the data frameprint("----- The number of rows without null values. ----- ")nrow(no_null) # Check the number of remaining rows.print("----- Find the ratio of the rows with null values to all. ----- ")1 - (nrow(no_null)/nrow(data)) # Find the ratio of rows with null values.
We found that 12% of the rows include null values. In this case, simply removing all null values will cause us to lose a big chunk of valuable data. Our solution is to replace the nulls with the most likely values. Let’s check which column includes how many null values.
# Define a for loop to see the number of null values in each columnfor (x in colnames(data)) {print(paste(x, "column includes", sum(is.na(data[x])), "null values" ))}
There are only two columns that include null values, tmdb_score
and imdb_score
. We’ll try to impute the values in these columns as much as possible. Let’s check the number of rows that include null values in both tmdb_score
and imdb_score
.
library(dplyr) # Call the dplyr package to work on the data frame# Filter the rows that include null values for both `imdb_score` and `tmdb_score` columnsdouble_nulls <- data %>% filter(is.na(imdb_score) & is.na(tmdb_score))print(' Number of rows that include nulls for both `imdb_score` and `tmdb_score`:')nrow(double_nulls) # Calculate the number of rowsprint(' The ratio of the deletable rows to all:')paste(round((nrow(double_nulls) / nrow(data) * 100),2),'%')
-
Line 3: We filter the rows that include nulls for both ratings.
-
Line 5: We find the number of rows in the filtered data.
We have 88 rows that include nulls for both imdb_score
and tmdb_score
. We will simply filter out these rows since they constitute ...