Exploratory Data Analysis: Visualization
Explore the Titanic training data further using ggplot2 visualizations.
Tables are data visualizations
When most people visualize data, they think of histograms, bar charts, box plots, etc. However, tables of data are also visualizations. In particular, tables are one of the best visualizations when examining individual values.
There are some open questions regarding the Titanic training data. The best way to explore these questions is to display the sample data as a table.
The code below uses the print()
function from the tibble
package to display the data. Run the code below and examine the output.
#================================================================================================# Load libraries - supress messages#suppressMessages(library(tidyverse))#================================================================================================# Load Titanic training data#titanic_train <- read_csv("titanic_train.csv", show_col_types = FALSE)#================================================================================================# Look at the first 10 rows of data as a table#titanic_train %>%print(width = 85, n = 10)
Looking at this output, several things become clear:
Look at the
PassengerId
data. The first ten rows of data show, combined with the previous lesson’s profiling, that it’s an identifier with a range of[1, 891]
. Unique identifiers like this are not predictive features. Therefore,PassengerId
shouldn’t be used to train a machine learning model.The
Name
feature is similar toPassengerId
. While not specifically an identifier, the fact that allName
values are unique means the feature cannot be used as is. However, there is valuable information contained in passenger names.Name
will be used later in the course to create new predictive features.The last lesson showed that some values of
Ticket
were shared among passengers. The output above shows many differentTicket
numbers being used. The profiling of the last lesson counted681
unique values for ...