Exercise: A Decision Tree in scikit-learn

Learn how to model a decision tree on our case study data and visualize it using graphviz.

We'll cover the following...

Modeling a decision tree on the case study data
Importance of max_depth
Try it yourself

Modeling a decision tree on the case study data

In this exercise, we will use the case study data to grow a decision tree, where we specify the maximum depth. We’ll also use some handy functionality to visualize the decision tree, in the form of the graphviz package. Perform the following steps to complete the exercise:

Load several of the packages that we’ve been using, and an additional one, graphviz, so that we can visualize decision trees:

import numpy as np #numerical computation 
import pandas as pd #data wrangling 
import matplotlib.pyplot as plt #plotting package 
#Next line helps with rendering plots 
%matplotlib inline 
import matplotlib as mpl #additional plotting functionality 
mpl.rcParams['figure.dpi'] = 400 #high res figures 
import graphviz #to visualize decision trees

Load the cleaned case study data:

df = pd.read_csv('Chapter_1_cleaned_data.csv')

Get a list of column names of the DataFrame:
```
features_response = df.columns.tolist()
```

Make a list of columns to remove that aren’t features or the response variable:

items_to_remove = ['ID', 'SEX', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'EDUCATION_CAT',\
'graduate school', 'high school', 'none', 'others', 'university']

Use a list comprehension to remove these column names from our list of features and the response variable:
```
features_response = [item for item in features_response if item not in items_to_remove] 
features_response
```
This should output the list of features and the response variable:
```
['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_1', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default payment next month']
```
Now the list of features is prepared. Next, we will make some imports from scikit-learn. We want to make a train/test split, which we are already familiar with. We also want to import the decision tree functionality.
Run this code to make imports from scikit-learn:
```
from sklearn.model_selection import train_test_split 
from sklearn import tree
```
The tree library of scikit-learn contains decision tree-related classes.
Split the data into training and testing sets using the same random seed that we have used throughout the course:
```
X_train, X_test, y_train, y_test = train_test_split(df[features_response[:-1]].values,\
df['default payment next month'].values, test_size=0.2, random_state=24)
```
Here, we use all but the last element of the list to get the names of the features, but not the response variable: features_response[:-1]. We use this to select columns from the DataFrame, and then retrieve their values using the ...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: A Decision Tree in scikit-learn

Modeling a decision tree on the case study data