The Course Datasets

Explore the main datasets used in this course, including the Adult Census Income and Titanic data. Understand their characteristics, how they are structured, and their role in building predictive machine learning models with R. Gain familiarity with data types and common terminology essential for successful data analysis and model development.

We'll cover the following...

The Adult Census Income dataset
The Titanic dataset
Data basics

The Adult Census Income dataset

This dataset will be used throughout the course for lesson examples. The dataset was extracted from the database for the US 1994 Census bureau. The dataset was created to explore the following question:

What characteristics are associated with income levels?
- Less than, or equal to, $50,000 USD/year?
- More than $50,000 USD/year?

Each row of the Adult Census Income dataset represents a US resident, and the columns represent the characteristics of that US resident. Here are some example characteristics of the dataset:

age: Age measured in years.
education: The highest level of education attained.
sex: Gender denoted as female or male.
hours_per_week: The number of hours worked at a job each week.
income: Yearly income denoted as <=50K or >50K.

Lessons throughout the course will use samples from the Adult Census Income dataset. In all cases, the goal is to use characteristics in the dataset (e.g., age) to predict income level (i.e., income).

More information on this dataset is available at the UCI Machine Learning Repository.

The Titanic dataset

This dataset will be used for the interactive coding aspects of the course. The dataset was created to explore the following question:

What are Titanic passenger characteristics associated with survival?

Each row of the Titanic dataset represents a passenger, and the columns represent the characteristics of the passenger. The following are the characteristics of the dataset:

Survived: Passenger survival. Values: 0 = no and 1 = yes.
Pclass: Class of a passenger ticket. Values: 1 = 1st class, 2 = 2nd class, and 3 = 3rd class.
Sex: Gender of the passenger. Values: female and male.
Age: Passenger age in years.
SipSp: Count of the passenger’s siblings/spouses aboard the Titanic.
Parch: Count of the passenger’s parents/children aboard the Titanic.
Ticket: Passenger’s ticket number.
Fare: The amount paid for the passenger’s ticket.
Cabin: Passenger’s cabin number.
Embarked: Passenger’s port of embarkation. Values: C = Cherbourg, Q = Queenstown, S = Southampton.

The Titanic dataset is used in this course for the following reasons:

The dataset is widely known.
The dataset is not 100 percent clean.
There are many opportunities to enrich the dataset (i.e., feature engineering).
Crafting a useful machine learning model (i.e., high prediction accuracy) is not easy.
For interested students, there is an opportunity to apply learning via the Kaggle Titanic machine learning competition.

More information on this dataset is available via the Kaggle website:

When entering the world of machine learning, it’s common to find different names used for various aspects of data tables. The following identify synonyms as they relate to tables of data:

Table / dataset / data frame / matrix
Row / observation / example
Column / feature / predictor / independent variable / characteristic
Label / prediction / output / dependent variable

Also, machine learning practitioners must consider the types of data being used—this is similar to data formats / types in technologies like Microsoft Excel and relational databases. The following defines the types of data used in machine learning:

Numeric: Data that can be measured (e.g., height, weight, price, etc.)
Categorical: Data that can be divided into distinct groups / classes (e.g., US states, Olympic medals, brands of automobiles, etc.)

Numeric data can be further divided into interval and ratio data. Categorical data can be further divided into nominal and ordinal data.

The differences between interval and ratio data will be covered later in the course. The machine learning techniques used in this course do not differentiate between nominal and ordinal data.

Age	Education	Sex	Hours Per Week	Income
39	Bachelors	Male	40	<=50K
50	Bachelors	Male	13	<=50K
38	HS-grad	Male	40	<=50K
53	11th	Male	40	<=50K
28	Bachelors	Female	40	<=50K

1.Welcome to the Course

2.Supervised Learning

3.Classification Tree Math

4.Using Classification Trees in R

5.Introducing the Bias-Variance Tradeoff

6.Model Tuning

7.Model Tuning with tidymodels

8.Feature Engineering

9.Regression Trees

10.The Random Forest Algorithm

11.Using Random Forests

12.Gradient Boosting Trees

13.Continuing Your Journey

Project

The Course Datasets

The Adult Census Income dataset

The Titanic dataset

Data basics

Adult Census Income Sample Data