Decision Tree

In this lesson, a non-parametric supervised learning model is introduced, which is Decision Tree.

What is Decision Tree

A Decision Tree is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. They are also a popular tool in Machine Learning.

Decision Trees are a very intuitive approach. You are just asking a series of questions about your target, then choose a branch based on your choice. When you split the tree, each node has two branches to split.

Unlike other models, the decision tree is the one with very good interpretability, because you know why you chose the left branch, not the right branch. Therefore, decision trees are widely used in the fields of finance, insurance, and medicine, because these fields want to know the reason for a certain decision when deciding.

Compare with other models

  • Simple to understand and to interpret.
  • Can handle the category type data.
  • Can handle the missing value.
  • The reasoning is fast.
  • Easy to be overfitting.
  • Performance is not good to compare with some models.
  • Some tasks can’t be learned.

Modeling on the Iris

Let’s skip the data loading and splitting as you can see the complete example later. Creating a decision tree is very simple, just create a DecisionTreeClassifier object from the tree module. Then, use the fit to train your tree.

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
# train_x,train_y is the training data and targets.
tree.fit(train_x,train_y)

You can evaluate this model with the confusion matrix like below. tree is the model, text_x and test_y are test data.

import sklearn.metrics as metrics

metrics.plot_confusion_matrix(tree, test_x, test_y)
Press + to interact
import sklearn.datasets as datasets
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as sktree
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
X, y = datasets.load_iris(return_X_y=True)
print("iris data size is {}".format(X.shape))
print("iris target size is {}".format(y.shape))
print("The first five samples of iris {}".format(X[:5]))
train_x, test_x, train_y, test_y = train_test_split(X,
y,
test_size=0.2,
random_state=42)
tree = DecisionTreeClassifier()
tree.fit(train_x, train_y)
pred_y = tree.predict(test_x)
cr = metrics.classification_report(test_y, pred_y)
print(cr)
  • From the line 7 to line 12, the iris dataset is loaded and split into two parts.

  • A decision tree object is created from DecisionTreeClassifier() at line 17 and trained at line 18.

  • This example uses the classification_report to evaluate the performance of our model, you can check the output at line 21.

As a tree model, this model should look like a real tree with branches and leaves. sklearn provides a very useful function, plot_tree, to plot your tree. Below is a result of a tree (please ignore the actual data, this just shows how this tree looks like).

How the decision tree splits

As you can see from the output above, a ...