Exploratory Data Analysis: Visualization
Explore the Titanic training data further using ggplot2 visualizations.
We'll cover the following...
Tables are data visualizations
When most people visualize data, they think of histograms, bar charts, box plots, etc. However, tables of data are also visualizations. In particular, tables are one of the best visualizations when examining individual values.
There are some open questions regarding the Titanic training data. The best way to explore these questions is to display the sample data as a table.
The code below uses the print() function from the tibble package to display the data. Run the code below and examine the output.
#================================================================================================# Load libraries - supress messages#suppressMessages(library(tidyverse))#================================================================================================# Load Titanic training data#titanic_train <- read_csv("titanic_train.csv", show_col_types = FALSE)#================================================================================================# Look at the first 10 rows of data as a table#titanic_train %>%print(width = 85, n = 10)
Looking at this output, several things become clear:
Look at the
PassengerIddata. The first ten rows of data show, combined with the previous lesson’s profiling, that it’s an identifier with a range of[1, 891]. Unique identifiers like this are not predictive features. Therefore,PassengerIdshouldn’t be used to train a machine learning model.The
Namefeature is similar toPassengerId. While not specifically an identifier, the fact that allNamevalues are unique means the feature cannot be used as is. However, there is valuable information contained in passenger names.Namewill be used later in the course to create new predictive features.The last lesson showed that some values of
Ticketwere shared among passengers. The output above shows many differentTicketnumbers being used. The profiling of the last lesson counted681unique values forTicket. This is too many values for the feature to be used directly.While
Agecould be potentially helpful in a model, it needs to include more values to be used directly. Imputation techniques for replacing ...