Random forest is an extension of the decision tree algorithm that builds multiple decision trees and combines their predictions to make more accurate and robust predictions. It creates a collection of decision trees, each trained on a random subset of data and features. This randomness helps to reduce overfitting and improve generalization.
The following steps demonstrate the process of training and visualizing a random forest model using the provided dataset.
In the first step, we import the necessary libraries.
import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
After importing libraries, we load the dataset from CSV file.
items = pd.read_csv('Data.csv')label1 = items.iloc[:, [2, 3]].valueslabel2 = items.iloc[:, 4].values
Here, we use iloc()
function in Python to assign the variables label1
and label2
the values of the feature variable and the values of the target variable respectively from the dataset.
In this step, we split the data into training and test sets using the train_test_split
function. The test set is set to be 25% of the entire dataset and random_state
is used to ensure
from sklearn.model_selection import train_test_splitlabel1_train, label1_test, label2_train, label2_test = train_test_split(label1, label2, test_size = 0.25, random_state = 0)
In this step, we scale the input features label1_train
and label1_test
to normalize the data.
from sklearn.preprocessing import StandardScalerscaling = StandardScaler()label1_train = scaling.fit_transform(label1_train)label1_test = scaling.transform(label1_test)
Here, we fit the random forest model on the training dataset.
from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=10, criterion='gini', random_state=0)model.fit(label1_train, label2_train)
Here, we predict the target variable label2
for the label1_test
.
prediction = model.predict(label1_test)from sklearn.metrics import confusion_matrixmatrix = confusion_matrix(label2_test, prediction)
We also make a confusion matrix to get the number of incorrect predictions.
As we can see, here we have incorrect predictions in total.
Then we create a scatter plot to visualize the decision boundaries of the trained random forest classifier on the training set.
from matplotlib.colors import ListedColormapseq1, seq2 = label1_train, label2_traingrid1, grid2 = np.meshgrid(np.arange(start = seq1[:, 0].min() - 1, stop = seq1[:, 0].max() + 1, step = 0.01),np.arange(start = seq1[:, 1].min() - 1, stop = seq1[:, 1].max() + 1, step = 0.01))plt.contourf(grid1, grid2, model.predict(np.array([grid1.ravel(), grid2.ravel()]).T).reshape(grid1.shape),alpha = 0.75, cmap = ListedColormap(('lightblue', 'peachpuff')))plt.xlim(grid1.min(), grid1.max())plt.ylim(grid2.min(), grid2.max())for key, value in enumerate(np.unique(seq2)):plt.scatter(seq1[seq2 == value, 0], seq1[seq2 == value, 1],c = ListedColormap(('mediumturquoise', 'lightsalmon'))(key), label = value)plt.title('Training set of random forest')plt.xlabel('Age')plt.ylabel('Estimated salary')plt.legend()plt.savefig('output/1_training.png')
Here we saved the visualization generated by the code as an image file named 1_training.png
in the specified folder output
using plt.savefig
.
Similarly, we create a scatter plot to visualize the decision boundaries of the trained random forest classifier on the test set.
from matplotlib.colors import ListedColormapseq1, seq2 = label1_test, label2_testgrid1, grid2 = np.meshgrid(np.arange(start = seq1[:, 0].min() - 1, stop = seq1[:, 0].max() + 1, step = 0.01),np.arange(start = seq1[:, 1].min() - 1, stop = seq1[:, 1].max() + 1, step = 0.01))plt.contourf(grid1, grid2, model.predict(np.array([grid1.ravel(), grid2.ravel()]).T).reshape(grid1.shape),alpha = 0.75, cmap = ListedColormap(('lightblue', 'peachpuff')))plt.xlim(grid1.min(), grid1.max())plt.ylim(grid2.min(), grid2.max())for key, value in enumerate(np.unique(seq2)):plt.scatter(seq1[seq2 == value, 0], seq1[seq2 == value, 1],c = ListedColormap(('mediumturquoise', 'lightsalmon'))(key), label = value)plt.title('Test set of random forest')plt.xlabel('Age')plt.ylabel('Estimated salary')plt.legend()plt.savefig('output/2_testing.png')
Here we saved the visualization generated by the code as an image file named 2_testing.png
in the specified folder output
using plt.savefig
.
import numpy as npimport matplotlib.pyplot as pltimport pandas as pditems = pd.read_csv('Data.csv')label1 = items.iloc[:, [2, 3]].valueslabel2 = items.iloc[:, 4].valuesfrom sklearn.model_selection import train_test_splitlabel1_train, label1_test, label2_train, label2_test = train_test_split(label1, label2, test_size = 0.25, random_state = 0)from sklearn.preprocessing import StandardScalerscaling = StandardScaler()label1_train = scaling.fit_transform(label1_train)label1_test = scaling.transform(label1_test)from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=10, criterion='gini', random_state=0)model.fit(label1_train, label2_train)prediction = model.predict(label1_test)from sklearn.metrics import confusion_matrixmatrix = confusion_matrix(label2_test, prediction)print(matrix)from matplotlib.colors import ListedColormapseq1, seq2 = label1_train, label2_traingrid1, grid2 = np.meshgrid(np.arange(start = seq1[:, 0].min() - 1, stop = seq1[:, 0].max() + 1, step = 0.01),np.arange(start = seq1[:, 1].min() - 1, stop = seq1[:, 1].max() + 1, step = 0.01))plt.contourf(grid1, grid2, model.predict(np.array([grid1.ravel(), grid2.ravel()]).T).reshape(grid1.shape),alpha = 0.75, cmap = ListedColormap(('lightblue', 'peachpuff')))plt.xlim(grid1.min(), grid1.max())plt.ylim(grid2.min(), grid2.max())for key, value in enumerate(np.unique(seq2)):plt.scatter(seq1[seq2 == value, 0], seq1[seq2 == value, 1],c = ListedColormap(('mediumturquoise', 'lightsalmon'))(key), label = value)plt.title('Training set of random forest')plt.xlabel('Age')plt.ylabel('Estimated salary')plt.legend()plt.savefig('output/1_training.png')plt.show()from matplotlib.colors import ListedColormapseq1, seq2 = label1_test, label2_testgrid1, grid2 = np.meshgrid(np.arange(start = seq1[:, 0].min() - 1, stop = seq1[:, 0].max() + 1, step = 0.01),np.arange(start = seq1[:, 1].min() - 1, stop = seq1[:, 1].max() + 1, step = 0.01))plt.contourf(grid1, grid2, model.predict(np.array([grid1.ravel(), grid2.ravel()]).T).reshape(grid1.shape),alpha = 0.75, cmap = ListedColormap(('lightblue', 'peachpuff')))plt.xlim(grid1.min(), grid1.max())plt.ylim(grid2.min(), grid2.max())for key, value in enumerate(np.unique(seq2)):plt.scatter(seq1[seq2 == value, 0], seq1[seq2 == value, 1],c = ListedColormap(('mediumturquoise', 'lightsalmon'))(key), label = value)plt.title('Test set of random forest')plt.xlabel('Age')plt.ylabel('Estimated salary')plt.legend()plt.savefig('output/2_testing.png')plt.show()
In conclusion, the random forest algorithm effectively handles complex datasets and makes accurate predictions. When using random forest, it's essential to experiment with different hyperparameters and criteria to achieve the best performance for a specific task.
Free Resources