The Dataset and Exploratory Data Analysis
Learn how to read the dataset and perform exploratory data analysis.
We'll cover the following
Let's explore one of the most famous and benchmark datasets of the Titanic disaster history. This dataset is considered a first step toward classification in machine learning.
Dataset
In the Titanic dataset, we have the following features. We want to predict if the passenger survived or not. Therefore, the target will be the Survived
column.
Data dictionary
PassengerId
: Passenger IDPclass
: Ticket class, where1
= 1st,2
= 2nd, and3
= 3rdName
: Passenger nameSex
: Male/femaleAge: Age in yearsSibSp
: Number of siblings and/or spouses aboard the TitanicParch
: Number of parents and/or children aboard the TitanicTicket
: Ticket numberFare
: Passenger fareCabin
: Cabin numberEmbarked
: Port of embarkation, whereC
= Cherbourg,Q
= Queenstown, andS
= SouthamptonSurvived
:0
= No, and1
= Yes
The goal here is to predict if a passenger survived the sinking of the Titanic or not. First, we’ll do some exploratory data analysis (EDA) and then use our understanding of logistic regression to train the model on the training part of the dataset for classification. We’ll then use the trained model to make predictions for the test dataset, which will be unseen by the model. We can then see the model performance for unseen data.
train_titanic_Xy.csv
: Train set with features and the targettest_titanic_X.csv
: Test set with features onlytest_titanic_y.csv
: Test set with target only
Let’s import some libraries and read the training part of the dataset in train
.
Get hands-on with 1400+ tech skills courses.