...

/

Machine Learning

Machine Learning

Train single and bagged decision trees along with random forest trees and evaluate.

Since our focus is machine learning, let's split the data and move on to train the model.

Press + to interact
# Separating features and the target in X, y
X = df.drop('target',axis=1) # X are features, need to drop target column
y = df['target'] # y is target!
# train_test_split!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

We'll start with training a single decision tree and then compare the results with a random forest.

Single decision tree

Let's train a single decision tree. The default splitting criterion is Gini. We can set it to entropy here (information gain is based on entropy).

Press + to interact
# importing decision tree classifier
from sklearn.tree import DecisionTreeClassifier
# creating instance "dtree" of the classifier
dtree = DecisionTreeClassifier(criterion='entropy')
#fitting to the training data, the default parameters are fine at the moment!
dtree.fit(X_train,y_train)

Notice that we’re leaving everything as default, other than the criterion.

Prediction and evaluation

Evaluation is important because then we can see how the model works.

Press + to interact
# doing predictions
dtree_pred = dtree.predict(X_test)
# imports
from sklearn.metrics import classification_report, confusion_matrix
# display evaluations
print(classification_report(y_test, dtree_pred))
print(confusion_matrix(y_test, dtree_pred))

We use a single decision tree, but we see that the model is mislabeling some. We also know that decision trees can be very easy to overfit, limiting generalization and leading to poor performance on unseen data.

Bagged decision trees

We learned about bagging (Bootstrap aggregation) as a general-purpose procedure for reducing the high variance. So, if we opt for bagged decision trees, they are expected to perform better than a single decision tree. However, due to their structural similarities, they are still strongly correlated in their predictions. The random forest method is always preferred and recommended over the single or even the bagged trees method. Let’s try bagged trees and then move on to the random forest for comparisons.

Press + to interact
# import required for bagging
from sklearn.ensemble import BaggingClassifier
#creating instance for bagging and passing dtree classifier along with other parameters
base_estimator = DecisionTreeClassifier(criterion='entropy') # base estimator for BaggingClassifier
bagged_trees = BaggingClassifier(
base_estimator=base_estimator,
n_estimators=5,# number of trees we want, try different numbers
bootstrap=True, # default value
bootstrap_features=True,# in-case we want to bootstrap features as well
max_features=8, # how many maximum number of features we want in each bootstrapped sample
random_state=42) # ensure reproducible results
bagged_trees.fit(X_train, y_train) #fitting/training

We have trained five bagged trees, and the final prediction for any test data will come from voting these bagged trees (the base estimator). As we have set the module to Bootstrap features (columns), let's see which features are used in the first two bagged trees for training. Please note that changing the random_state ...

Access this course and 1400+ top-rated courses and projects.