Hands-on Machine Learning with Scikit-Learn/

...

Decision Tree

In this lesson, a non-parametric supervised learning model is introduced, which is Decision Tree.

We'll cover the following...

- What is Decision Tree
- Compare with other models
- Modeling on the Iris
- How the decision tree splits

What is Decision Tree

A Decision Tree is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. They are also a popular tool in Machine Learning.

Decision Trees are a very intuitive approach. You are just asking a series of questions about your target, then choose a branch based on your choice. When you split the tree, each node has two branches to split.

Unlike other models, the decision tree is the one with very good interpretability, because you know why you chose the left branch, not the right branch. Therefore, decision trees are widely used in the fields of finance, insurance, and medicine, because these fields want to know the reason for a certain decision when deciding.

Compare with other models

Simple to understand and to interpret.
Can handle the category type data.
Can handle the missing value.
The reasoning is fast.
Easy to be overfitting.
Performance is not good to compare with some models.
Some tasks can’t be learned.

Modeling on the Iris

Let’s skip the data loading and splitting as you can see the complete example later. Creating a decision tree is very simple, just create a DecisionTreeClassifier object from the tree module. Then, use the fit to train your tree.

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
# train_x,train_y is the training data and targets.
tree.fit(train_x,train_y)

You can evaluate this model with the confusion matrix like below. tree is the model, text_x and test_y are test data.

import sklearn.metrics as metrics

metrics.plot_confusion_matrix(tree, test_x, test_y)

Press + to interact

import sklearn.datasets as datasets
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as sktree
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
X, y = datasets.load_iris(return_X_y=True)
print("iris data size is {}".format(X.shape))
print("iris target size is {}".format(y.shape))
print("The first five samples of iris {}".format(X[:5]))
train_x, test_x, train_y, test_y = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)
tree = DecisionTreeClassifier()
tree.fit(train_x, train_y)
pred_y = tree.predict(test_x)
cr = metrics.classification_report(test_y, pred_y)
print(cr)

Preliminaries

Working with Datasets

Feature Engineering

General Concepts

Linear Regression

Logistic Regression

Support Vector Machine

Tree Model and Ensemble Method

Unsupervised Learning

Deep Learning

Others

What's Next

Decision Tree

What is Decision Tree

Compare with other models

Modeling on the Iris

How the decision tree splits