...

/

Exploratory Data Analysis: Visualization

Exploratory Data Analysis: Visualization

Explore the Titanic training data further using ggplot2 visualizations.

Tables are data visualizations

When most people visualize data, they think of histograms, bar charts, box plots, etc. However, tables of data are also visualizations. In particular, tables are one of the best visualizations when examining individual values.

There are some open questions regarding the Titanic training data. The best way to explore these questions is to display the sample data as a table.

The code below uses the print() function from the tibble package to display the data. Run the code below and examine the output.

Press + to interact
#================================================================================================
# Load libraries - supress messages
#
suppressMessages(library(tidyverse))
#================================================================================================
# Load Titanic training data
#
titanic_train <- read_csv("titanic_train.csv", show_col_types = FALSE)
#================================================================================================
# Look at the first 10 rows of data as a table
#
titanic_train %>%
print(width = 85, n = 10)

Looking at this output, several things become clear:

  • Look at the PassengerId data. The first ten rows of data show, combined with the previous lesson’s profiling, that it’s an identifier with a range of [1, 891]. Unique identifiers like this are not predictive features. Therefore, PassengerId shouldn’t be used to train a machine learning model.

  • The Name feature is similar to PassengerId. While not specifically an identifier, the fact that all Name values are unique means the feature cannot be used as is. However, there is valuable information contained in passenger names. Name will be used later in the course to create new predictive features.

  • The last lesson showed that some values of Ticket were shared among passengers. The output above shows many different Ticket numbers being used. The profiling of the last lesson counted 681 unique values for ...