Task: Netflix Shows and Movies

Practice all the concepts you learned in this course in this project.

Our data and objective

We have data about Netflix movies and shows. The dataset covers various aspects of the movies, like ID, title, production location, release year, genre, TMDB score, and IMDb score.

Our objective is to create visualizations about the overall quality of movies and shows made in different countries.

Note: Determining the objective is crucial because we will filter and clean the data according to our goal.

Here is a preview of the CSV file:

Press + to interact

Explore the data

We need to get to know the data before we work on it. This helps us create our strategy and use our time and energy efficiently.

Our initial data exploration involves checking the number of rows and columns, the column names, and the number of null values in the dataset.

Press + to interact
data <- read.csv("Documents/movies_shows.csv") # Read the CSV file.
print(" ----- The names of the columns in the dataset. -----")
colnames(data) # The names of the columns in the dataset.
print("----- The number of columns in the dataset. ----- ")
ncol(data) # The number of columns in the dataset.
print("----- The number of rows in the dataset. ----- ")
nrow(data) # The number of rows in the dataset.
print("----- The number of null values in the data frame. -----")
sum(is.na(data)) # The number of null values in the data frame
  • Line 1: We read the data from our CSV file.

  • Line 3: We check the column names in the data frame so that we can select the columns that are useful for us.

  • Line 5—7: We check the number of rows and columns. It gives us an idea about the size of the data we are dealing with. The size of the data frame is 10 ×\times 5,850.

  • Line 9: We check the number of null values in the data frame. There are 793 null values in the data frame.

Now, we will check the number of rows without null values by dropping all null values.

Press + to interact
no_null <- na.omit(data) # Remove the null values from the data frame
print("----- The number of rows without null values. ----- ")
nrow(no_null) # Check the number of remaining rows.
print("----- Find the ratio of the rows with null values to all. ----- ")
1 - (nrow(no_null)/nrow(data)) # Find the ratio of rows with null values.

We found that 12% of the rows include null values. In this case, simply removing all null values will cause us to lose a big chunk of valuable data. Our solution is to replace the nulls with the most likely values. Let’s check which column includes how many null values.

Press + to interact
# Define a for loop to see the number of null values in each column
for (x in colnames(data)) {
print(paste(x, "column includes", sum(is.na(data[x])), "null values" ))
}

There are only two columns that include null values, tmdb_score and imdb_score. We’ll try to impute the values in these columns as much as possible. Let’s check the number of rows that include null values in both tmdb_score and imdb_score.

Press + to interact
library(dplyr) # Call the dplyr package to work on the data frame
# Filter the rows that include null values for both `imdb_score` and `tmdb_score` columns
double_nulls <- data %>% filter(is.na(imdb_score) & is.na(tmdb_score))
print(' Number of rows that include nulls for both `imdb_score` and `tmdb_score`:')
nrow(double_nulls) # Calculate the number of rows
print(' The ratio of the deletable rows to all:')
paste(round((nrow(double_nulls) / nrow(data) * 100),2),'%')
  • Line 3: We filter the rows that include nulls for both ratings.

  • Line 5: We find the number of rows in the filtered data.

We have 88 rows that include nulls for both imdb_score and tmdb_score. We will simply filter out these rows since they constitute ...