Machine Learning
Train single and bagged decision trees along with random forest trees and evaluate.
Since our focus is machine learning, let's split the data and move on to train the model.
# Separating features and the target in X, yX = df.drop('target',axis=1) # X are features, need to drop target columny = df['target'] # y is target!# train_test_split!from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
We'll start with training a single decision tree and then compare the results with a random forest.
Single decision tree
Let's train a single decision tree. The default splitting criterion is Gini. We can set it to entropy here (information gain is based on entropy).
# importing decision tree classifierfrom sklearn.tree import DecisionTreeClassifier# creating instance "dtree" of the classifierdtree = DecisionTreeClassifier(criterion='entropy')#fitting to the training data, the default parameters are fine at the moment!dtree.fit(X_train,y_train)
Notice that we’re leaving everything as default, other than the criterion.
Prediction and evaluation
Evaluation is important because then we can see how the model works.
# doing predictionsdtree_pred = dtree.predict(X_test)# importsfrom sklearn.metrics import classification_report, confusion_matrix# display evaluationsprint(classification_report(y_test, dtree_pred))print(confusion_matrix(y_test, dtree_pred))
We use a single decision tree, but we see that the model is mislabeling some. We also know that decision trees can be very easy to overfit, limiting generalization and leading to poor performance on unseen data.
Bagged decision trees
We learned about bagging (Bootstrap aggregation) as a general-purpose procedure for reducing the high variance. So, if we opt for bagged decision trees, they are expected to perform better than a single decision tree. However, due to their structural similarities, they are still strongly correlated in their predictions. The random forest method is always preferred and recommended over the single or even the bagged trees method. Let’s try bagged trees and then move on to the random forest for comparisons.
# import required for baggingfrom sklearn.ensemble import BaggingClassifier#creating instance for bagging and passing dtree classifier along with other parametersbase_estimator = DecisionTreeClassifier(criterion='entropy') # base estimator for BaggingClassifierbagged_trees = BaggingClassifier(base_estimator=base_estimator,n_estimators=5,# number of trees we want, try different numbersbootstrap=True, # default valuebootstrap_features=True,# in-case we want to bootstrap features as wellmax_features=8, # how many maximum number of features we want in each bootstrapped samplerandom_state=42) # ensure reproducible resultsbagged_trees.fit(X_train, y_train) #fitting/training
We have trained five bagged trees, and the final prediction for any test data will come from voting these bagged trees (the base estimator). As we have set the module to Bootstrap features (columns), let's see which features are used in the first two bagged trees for training. Please note that changing the random_state
...