Decision Tree
In this lesson, a non-parametric supervised learning model is introduced, which is Decision Tree.
We'll cover the following...
What is Decision Tree
A Decision Tree is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. They are also a popular tool in Machine Learning.
Decision Trees are a very intuitive approach. You are just asking a series of questions about your target, then choose a branch based on your choice. When you split the tree, each node has two branches to split.
Unlike other models, the decision tree is the one with very good interpretability, because you know why you chose the left branch, not the right branch. Therefore, decision trees are widely used in the fields of finance, insurance, and medicine, because these fields want to know the reason for a certain decision when deciding.
Compare with other models
- Simple to understand and to interpret.
- Can handle the category type data.
- Can handle the missing value.
- The reasoning is fast.
- Easy to be overfitting.
- Performance is not good to compare with some models.
- Some tasks can’t be learned.
Modeling on the Iris
Let’s skip the data loading and splitting as you can see the complete example later. Creating a decision tree is very simple, just create a DecisionTreeClassifier
object from the tree
module. Then, use the fit
to train your tree.
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
# train_x,train_y is the training data and targets.
tree.fit(train_x,train_y)
You can evaluate this model with the confusion matrix like below. tree
is the model, text_x
and test_y
are test data.
import sklearn.metrics as metrics
metrics.plot_confusion_matrix(tree, test_x, test_y)
import sklearn.datasets as datasetsfrom sklearn.tree import DecisionTreeClassifierimport sklearn.tree as sktreefrom sklearn.model_selection import train_test_splitimport sklearn.metrics as metricsX, y = datasets.load_iris(return_X_y=True)print("iris data size is {}".format(X.shape))print("iris target size is {}".format(y.shape))print("The first five samples of iris {}".format(X[:5]))train_x, test_x, train_y, test_y = train_test_split(X,y,test_size=0.2,random_state=42)tree = DecisionTreeClassifier()tree.fit(train_x, train_y)pred_y = tree.predict(test_x)cr = metrics.classification_report(test_y, pred_y)print(cr)
-
From the
line 7
toline 12
, the iris dataset is loaded and split into two parts. -
A decision tree object is created from
DecisionTreeClassifier()
atline 17
and trained atline 18
. -
This example uses the
classification_report
to evaluate the performance of our model, you can check the output atline 21
.
As a tree model, this model should look like a real tree
with branches and leaves. sklearn
provides a very useful function, plot_tree
, to plot your tree. Below is a result of a tree (please ignore the actual data, this just shows how this tree looks like).
How the decision tree splits
As you can see from the output above, a ...