The Dataset and Exploratory Data Analysis

Learn how to read the dataset and perform exploratory data analysis.

Let's explore one of the most famous and benchmark datasets of the Titanic disaster history. This dataset is considered a first step toward classification in machine learning.

Dataset

In the Titanic dataset, we have the following features. We want to predict if the passenger survived or not. Therefore, the target will be the Survived column.

Data dictionary

  • PassengerId: Passenger ID

  • Pclass: Ticket class, where 1 = 1st, 2 = 2nd, and 3 = 3rd

  • Name: Passenger name

  • Sex: Male/femaleAge: Age in years

  • SibSp: Number of siblings and/or spouses aboard the Titanic

  • Parch: Number of parents and/or children aboard the Titanic

  • Ticket: Ticket number

  • Fare: Passenger fare

  • Cabin: Cabin number

  • Embarked: Port of embarkation, where C = Cherbourg, Q = Queenstown, and S = Southampton

  • Survived: 0 = No, and 1 = Yes

The goal here is to predict if a passenger survived the sinking of the Titanic or not. First, we’ll do some exploratory data analysis (EDA) and then use our understanding of logistic regression to train the model on the training part of the dataset for classification. We’ll then use the trained model to make predictions for the test dataset, which will be unseen by the model. We can then see the model performance for unseen data.

  • train_titanic_Xy.csv: Train set with features and the target

  • test_titanic_X.csv: Test set with features only

  • test_titanic_y.csv: Test set with target only

Let’s import some libraries and read the training part of the dataset in train.

Get hands-on with 1200+ tech skills courses.