Isolation Forest is an unsupervised technique used to detect anomalies in the dataset. In this technique, the data points are split to record isolated observations for all the data points.
We use a random Forest Algorithm where a feature is selected randomly, and then from that feature's range, a random split value is selected for it. The splittings required to isolate data are equivalent to the
Let's write a code step-by-step that generates sample data, applies a training model to it, and then creates a scatter plot with decision boundaries.
Before starting the code. let's understand the modules we must import and how they are used.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.inspection import DecisionBoundaryDisplayfrom sklearn.ensemble import IsolationForest
numpy
: To handle data arrays and perform numerical operations.
matplotlib.pyplot
: To create and customize data visuals, including various types of plots.
sklearn.model_selection
: To split the data into training and testing data sets.
sklearn.inspection
: To visualize decision boundaries and analyze model behavior, like separating the data points.
sklearn.ensemble
: To build an ensemble machine-learning model for detecting anomalies. In this case, we use IsolationForest
.
We create an rng
object for numpy.random
to generate the sample data by randomly sampling the standard normal distribution.
rng.randn
is used to create clusters for the inliers, and a ground truth label is assigned to them, i.e., 1.
rng.uniform
is used to create the outliers and a ground truth label is assigned to them, i.e., -1.
Take a look at clusterOne
to understand the formula used to generate a cluster. We set the standard deviation, i.e., 0.4, and then take its product with the randomly generated array of the samplesCount
size containing 2
features. Then we take its dot product with the convarianceMatrix
and add [2 , 2]
to each data point to shift the data points to a new location corresponding to the x and y-axis.
In this code, we generate data that is split into test and training data and plot a simple scatter plot to display the data points.
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split samplesCount, outliersCount = 150, 50 rng = np.random.RandomState(0) covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]]) clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2]) clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2]) outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2)) X = np.concatenate([clusterOne, clusterTwo, outliers]) Y = np.concatenate( [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)] ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42) scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green") handles, labels = scatter.legend_elements() plt.axis("square") plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class") plt.title("Gaussian inliers with \nuniformly distributed outliers") plt.show()
Lines 1–3: Import the required method and libraries.
Line 5: Create two variables, samplesCount
and outliersCount
, and save the count for each, respectively.
Line 6: Create and initialize a random number generator rng
object using random from NumPy
.
Line 8: Create a covarianceMatrix
using array()
from NumPy
and pass the value arrays as parameters.
Lines 10–12: Create two cluster arrays and one outlier array that store the data points. We create them using NumPy
to generate random data.
Line 14: Create a NumPy
array X
and form a combined dataset using concatenate()
and passing the cluster and outlier arrays as parameters.
Line 15: Create a NumPy
class array Y
and create a combined labels dataset using concatenate()
and passing the inliers and outliers arrays as parameters.
Note: Inliers array is an array of ones with length equivalent to
2 * samplesCount
.Outliers array is an array of negative ones with length equivalent to
outliersCount
.
Line 19: Create a NumPy
array X and create a combined dataset using concatenate()
and passing the cluster and outlier arrays as parameters. Use train_test_split()
to split the data into training and testing subsets and define the data on created objects X_train
, Y_train
. X_test
and Y_test
.
Lines 21–25: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements.
Line 27: Use show()
to display the created plot.
A scattered plot is created that shows the clusters in yellow representing the inliers and sparsed data points in purple representing the outliers.
We import the IsolationForest
class from the ensemble
module and use its instance for training the model. We specify the maximum number of samples that should be used while building each decision tree in the IsolationForest. Then we pass the training data, X_train
, to the model using the fit() method. The training samples are used to construct individual isolated trees for the forest.
Below is the code snippet that shows how this working is implemented.
from sklearn.ensemble import IsolationForestclf = IsolationForest(max_samples=100, random_state=0)clf.fit(X_train)
Once the model is trained, we can analyze the anomaly score and determine the anomalies based on the data points with the lowest score.
Note: Data points with low anomaly score are considered anomalous.
Decision boundaries are the split points that are used to partition the data into different regions and isolate the anomalous data points from the normal data. To plot these, we require a collection of isolation trees that have data split into subsets. We use the feature value and the selected response method to determine the decision boundaries.
The DecisionBoundaryDisplay
class is used to visualize the decision boundaries in the plots.
Let's create decision boundaries for the generated data above using two different types:
Discrete
Path length
The discrete decision boundary uses the predict response method, where the background color represents the area in which a prediction is made about the sample being an inlier or an outlier. Each class region is separated using a distinct color.
We use the previously generated data and add code for model training in it and then create a scatter plot using the discrete decision boundary on a single class.
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.inspection import DecisionBoundaryDisplay from sklearn.ensemble import IsolationForest # Data generation explained in the previous step samplesCount, outliersCount = 150, 50 rng = np.random.RandomState(0) covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]]) clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2]) clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2]) outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2)) X = np.concatenate([clusterOne, clusterTwo, outliers]) Y = np.concatenate( [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)] ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42) #Train model clf = IsolationForest(max_samples=100, random_state=0) clf.fit(X_train) #Plot descrete decision boundaries display = DecisionBoundaryDisplay.from_estimator( clf, X, response_method="predict", alpha=0.5, ) scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green") handles, labels = scatter.legend_elements() display.ax_.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green") display.ax_.set_title("Plot discrete decision boundary \nof IsolationForest") plt.axis("square") plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class") plt.show()
Lines 29–33: Create a DecisionBoundaryDisplay
object and set it on the display
variable. Pass the following as parameters to it:
clf
: The IsolationForest
model instance that is used to train the model.
X
: The feature data that is used to train the model.
response_method
: To specify which method is used to get the model response. In this case, we use predict
that provides predicted labels for the data as either 1
or -1
for inliers and outliers, respectively.
alpha
: To set the transparency of the boundary.
Line 36–37: Create a scatter variable and define the plot using plot.scatter()
and passing the properties as parameters.
X[:, 0]
: Choose the first column of the X
array that contains x-axis values.
X[:, 1]
: Choose the second column of the X
array that contains y-axis values.
c
: Specify the color of each scatter point based on the Y
array containing data points, labels according to which the colors are assigned.
s
: Set the size of each marker in the scatterplot i.e. 20
in this case.
edgecolor
: Set the color of the green
in this case.
Lines 39–42: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements. Use display.ax_
axis customize details:
display.ax_.scatter()
: To create the scatter plot for the data points.
display.ax_.title()
: To give a suitable title to the scatter plot.
Line 44: Use show()
to display the created plot.
A scattered plot is created that shows the clusters in yellow representing the inliers, sparsed data points in purple representing the outliers, and the colored base representing that prediction has been made in all the areas.
The path length decision boundary uses the decision-making response method, where the background color is used to represent the measure of normality of an observation. The score is calculated by taking the average of the path length in a random tree forest. When random trees from a forest are noticed with short path lengths to isolate samples, those samples can be detected as anomalies.
Samples that have a measure of normality close to 0 are considered outliers.
Samples that have a measure of normality close to 1 are considered inliers.
We use the previously generated data, add the model training code, and then create a scatter plot using the path length decision boundary.
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.inspection import DecisionBoundaryDisplay from sklearn.ensemble import IsolationForest # Data generation explained in the previous step samplesCount, outliersCount = 150, 50 rng = np.random.RandomState(0) covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]]) clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2]) clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2]) outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2)) X = np.concatenate([clusterOne, clusterTwo, outliers]) Y = np.concatenate( [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)] ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42) #Train model clf = IsolationForest(max_samples=100, random_state=0) clf.fit(X_train) #Plot path length decision boundaries display = DecisionBoundaryDisplay.from_estimator( clf, X, response_method="decision_function", alpha=0.5, ) scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green") handles, labels = scatter.legend_elements() display.ax_.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green") display.ax_.set_title("Plot path length decision boundary \nof IsolationForest") plt.axis("square") plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class") plt.colorbar(display.ax_.collections[1]) plt.show()
Lines 29–33: Create a DecisionBoundaryDisplay
object and set it on the display
variable. Pass the following as parameters to it:
clf
: The IsolationForest
model instance that is used to train the model.
X
: the feature data that is used to train the model.
response_method
: To specify which method is used to get the model response. In this case, we use decision_function
that provides data points' anomaly scores used for plotting decision boundaries.
alpha
: To set the transparency of the boundary.
Line 36–37: Create a scatter variable and define the plot using plot.scatter()
and passing the properties as parameters.
X[:, 0]
: Choose the first column of the X
array that contains x-axis values.
X[:, 1]
: Choose the second column of the X
array that contains y-axis values.
c
: Specify the color of each scatter point based on the Y
array containing data points, labels according to which the colors are assigned.
s
: Set the size of each marker in the scatterplot i.e. 20
in this case.
edgecolor
: Set the color of the green
in this case.
Lines 39–43: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements. Use display.ax_
axis customize details:
display.ax_.scatter()
: To create the scatter plot for the data points.
display.ax_.title()
: To give a suitable title to the scatter plot.
display.ax_.collections[]
: To access the plot elements so that the color bar can be attached to them using colorbar()
.
Line 45: Use show()
to display the created plot.
A scattered plot is created that shows the clusters in yellow representing the inliers and sparsed data points in purple representing the outliers.
The isolated forest technique can be widely used to detect anomalies in different real-life domains that are safety critical and have low fault tolerance.
Cybersecurity: Can be used in cybersecurity to detect intrusion that can threaten the system. It can identify the unusual activities in the network that can be a potential security breach and notify about it.
Health care: Can be used in health care to detect abnormality in the patient's body or reports. It can identify the unusual occurring in a person's vitals and behavior, showing a problem. Adding to it, it can also analyze the lab reports and help in identifying the disease.
Climate sector: Can be used in environmental data monitoring to detect climate changes. It can identify anomalous changes in temperature, water, or air quality that can be used to make climate predictions.
Isolation Forest is an effective unsupervised technique to detect anomalies in a dataset and present it in a scatter plot. We can implement it using the following steps:
Generate data that can be split into test and training data.
Train the model to identify anomalies in the datasets.
Plot the decision boundaries that can be used to identify the inliers and outliers. We can use discrete and path-length decision boundaries, which differ in their response method.
Discrete | Path length | |
response_method | predict | decision_function |
colorbar | does not require it. | pass the plot elements to it. |
Outliers in path length decision boundary
ground truth label is assigned as 1
Inliers in path length decision boundary
measure of normality close to 0
Inliers in data generation
ground truth label is assigned as -1
Outliers in data generation
measure of normality close to 1
Free Resources