Outlier detection with Isolation Forest

Isolation Forest is an unsupervised technique used to detect anomalies in the dataset. In this technique, the data points are split to record isolated observations for all the data points.

How does it work?

We use a random Forest Algorithm where a feature is selected randomly, and then from that feature's range, a random split value is selected for it. The splittings required to isolate data are equivalent to the path lengthmeasure of normality and our decision function. for a sample feature. Random partitioning produces shorter paths for anomalous sample data. Hence, when a shorter path length is noticed for sample data, it can be identified as an anomaly.

Sample with the shortest path length considered outlier.
Sample with the shortest path length considered outlier.

How to implement this understanding?

Let's write a code step-by-step that generates sample data, applies a training model to it, and then creates a scatter plot with decision boundaries.

Before starting the code. let's understand the modules we must import and how they are used.

Required imports

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.ensemble import IsolationForest
  • numpy: To handle data arrays and perform numerical operations.

  • matplotlib.pyplot: To create and customize data visuals, including various types of plots.

  • sklearn.model_selection: To split the data into training and testing data sets.

  • sklearn.inspection: To visualize decision boundaries and analyze model behavior, like separating the data points.

  • sklearn.ensemble: To build an ensemble machine-learning model for detecting anomalies. In this case, we use IsolationForest.

Step 1: Generate data

We create an rng object for numpy.random to generate the sample data by randomly sampling the standard normal distribution.

  • rng.randn is used to create clusters for the inliers, and a ground truth label is assigned to them, i.e., 1.

  • rng.uniform is used to create the outliers and a ground truth label is assigned to them, i.e., -1.

Take a look at clusterOne to understand the formula used to generate a cluster. We set the standard deviation, i.e., 0.4, and then take its product with the randomly generated array of the samplesCount size containing 2 features. Then we take its dot product with the convarianceMatrix and add [2 , 2] to each data point to shift the data points to a new location corresponding to the x and y-axis.

Example code

In this code, we generate data that is split into test and training data and plot a simple scatter plot to display the data points.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

samplesCount, outliersCount = 150, 50
rng = np.random.RandomState(0)

covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]])

clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2])  
clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2])  
outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2))

X = np.concatenate([clusterOne, clusterTwo, outliers])
Y = np.concatenate(
    [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)]
)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42)

scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")

plt.show()
Detecting outliers using isolated forest technique.

Code explanation

  • Lines 1–3: Import the required method and libraries.

  • Line 5: Create two variables, samplesCount and outliersCount, and save the count for each, respectively.

  • Line 6: Create and initialize a random number generator rng object using random from NumPy.

  • Line 8: Create a covarianceMatrix using array() from NumPy and pass the value arrays as parameters.

  • Lines 10–12: Create two cluster arrays and one outlier array that store the data points. We create them using NumPy to generate random data.

  • Line 14: Create a NumPy array X and form a combined dataset using concatenate() and passing the cluster and outlier arrays as parameters.

  • Line 15: Create a NumPy class array Y and create a combined labels dataset using concatenate() and passing the inliers and outliers arrays as parameters.

Note: Inliers array is an array of ones with length equivalent to 2 * samplesCount.

Outliers array is an array of negative ones with length equivalent to outliersCount.

  • Line 19: Create a NumPy array X and create a combined dataset using concatenate() and passing the cluster and outlier arrays as parameters. Use train_test_split() to split the data into training and testing subsets and define the data on created objects X_train, Y_train. X_test and Y_test.

  • Lines 21–25: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements.

  • Line 27: Use show() to display the created plot.

Code output

A scattered plot is created that shows the clusters in yellow representing the inliers and sparsed data points in purple representing the outliers.

Scattered plot generated after applying isolation forest technique.
Scattered plot generated after applying isolation forest technique.

Step 2: Train model

We import the IsolationForest class from the ensemble module and use its instance for training the model. We specify the maximum number of samples that should be used while building each decision tree in the IsolationForest. Then we pass the training data, X_train, to the model using the fit() method. The training samples are used to construct individual isolated trees for the forest.

Below is the code snippet that shows how this working is implemented.

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

Once the model is trained, we can analyze the anomaly score and determine the anomalies based on the data points with the lowest score.

Note: Data points with low anomaly score are considered anomalous.

Step 3: Plot the decision boundaries

Decision boundaries are the split points that are used to partition the data into different regions and isolate the anomalous data points from the normal data. To plot these, we require a collection of isolation trees that have data split into subsets. We use the feature value and the selected response method to determine the decision boundaries.

The DecisionBoundaryDisplay class is used to visualize the decision boundaries in the plots.

Let's create decision boundaries for the generated data above using two different types:

  • Discrete

  • Path length

Discrete decision boundary

The discrete decision boundary uses the predict response method, where the background color represents the area in which a prediction is made about the sample being an inlier or an outlier. Each class region is separated using a distinct color.

coloured area shows prediction have been made here.
coloured area shows prediction have been made here.

Example code

We use the previously generated data and add code for model training in it and then create a scatter plot using the discrete decision boundary on a single class.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.ensemble import IsolationForest

# Data generation explained in the previous step
samplesCount, outliersCount = 150, 50
rng = np.random.RandomState(0)

covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]])

clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2])  
clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2])  
outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2))

X = np.concatenate([clusterOne, clusterTwo, outliers])
Y = np.concatenate(
    [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)]
)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42)

#Train model
clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

#Plot descrete decision boundaries
display = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)

scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green")
handles, labels = scatter.legend_elements()

display.ax_.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green")
display.ax_.set_title("Plot discrete decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")

plt.show()
Detecting outliers using isolated forest technique.

Code explanation

  • Lines 29–33: Create a DecisionBoundaryDisplay object and set it on the display variable. Pass the following as parameters to it:

    • clf: The IsolationForest model instance that is used to train the model.

    • X: The feature data that is used to train the model.

    • response_method: To specify which method is used to get the model response. In this case, we use predict that provides predicted labels for the data as either 1 or -1 for inliers and outliers, respectively.

    • alpha: To set the transparency of the boundary.

  • Line 36–37: Create a scatter variable and define the plot using plot.scatter() and passing the properties as parameters.

    • X[:, 0]: Choose the first column of the X array that contains x-axis values.

    • X[:, 1]: Choose the second column of the X array that contains y-axis values.

    • c: Specify the color of each scatter point based on the Y array containing data points, labels according to which the colors are assigned.

    • s: Set the size of each marker in the scatterplot i.e. 20 in this case.

    • edgecolor: Set the color of the edgesborder around each scatter marker to improve visibility i.e. green in this case.

  • Lines 39–42: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements. Use display.ax_ axis customize details:

    • display.ax_.scatter(): To create the scatter plot for the data points.

    • display.ax_.title(): To give a suitable title to the scatter plot.

  • Line 44: Use show() to display the created plot.

Code output

A scattered plot is created that shows the clusters in yellow representing the inliers, sparsed data points in purple representing the outliers, and the colored base representing that prediction has been made in all the areas.

A plot with discrete decision boundary.
A plot with discrete decision boundary.

Path length decision boundaries

The path length decision boundary uses the decision-making response method, where the background color is used to represent the measure of normality of an observation. The score is calculated by taking the average of the path length in a random tree forest. When random trees from a forest are noticed with short path lengths to isolate samples, those samples can be detected as anomalies.

  • Samples that have a measure of normality close to 0 are considered outliers.

  • Samples that have a measure of normality close to 1 are considered inliers.

Example code

We use the previously generated data, add the model training code, and then create a scatter plot using the path length decision boundary.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.ensemble import IsolationForest

# Data generation explained in the previous step
samplesCount, outliersCount = 150, 50
rng = np.random.RandomState(0)

covarianceMatrix = np.array([[0.5, -0.1], [0.7, 0.4]])

clusterOne = 0.4 * rng.randn(samplesCount, 2) @ covarianceMatrix + np.array([2, 2])  
clusterTwo = 0.3 * rng.randn(samplesCount, 2) + np.array([-2, -2])  
outliers = rng.uniform(low=-4, high=4, size=(outliersCount, 2))

X = np.concatenate([clusterOne, clusterTwo, outliers])
Y = np.concatenate(
    [np.ones((2 * samplesCount), dtype=int), -np.ones((outliersCount), dtype=int)]
)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=42)

#Train model
clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

#Plot path length decision boundaries
display = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)

scatter = plt.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green")
handles, labels = scatter.legend_elements()

display.ax_.scatter(X[:, 0], X[:, 1], c=Y, s=20, edgecolor="green")
display.ax_.set_title("Plot path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(display.ax_.collections[1])

plt.show()
Detecting outliers using isolated forest technique.

Code explanation

  • Lines 29–33: Create a DecisionBoundaryDisplay object and set it on the display variable. Pass the following as parameters to it:

    • clf: The IsolationForest model instance that is used to train the model.

    • X: the feature data that is used to train the model.

    • response_method: To specify which method is used to get the model response. In this case, we use decision_function that provides data points' anomaly scores used for plotting decision boundaries.

    • alpha: To set the transparency of the boundary.

  • Line 36–37: Create a scatter variable and define the plot using plot.scatter() and passing the properties as parameters.

    • X[:, 0]: Choose the first column of the X array that contains x-axis values.

    • X[:, 1]: Choose the second column of the X array that contains y-axis values.

    • c: Specify the color of each scatter point based on the Y array containing data points, labels according to which the colors are assigned.

    • s: Set the size of each marker in the scatterplot i.e. 20 in this case.

    • edgecolor: Set the color of the edgesborder around each scatter marker to improve visibility i.e. green in this case.

  • Lines 39–43: Create a scattered plot for the generated dataset and specify its properties to customize it according to our requirements. Use display.ax_ axis customize details:

    • display.ax_.scatter(): To create the scatter plot for the data points.

    • display.ax_.title(): To give a suitable title to the scatter plot.

    • display.ax_.collections[]: To access the plot elements so that the color bar can be attached to them using colorbar().

  • Line 45: Use show() to display the created plot.

Code output

A scattered plot is created that shows the clusters in yellow representing the inliers and sparsed data points in purple representing the outliers.

A plot with path length decision boundary.
A plot with path length decision boundary.

Real-life applications

The isolated forest technique can be widely used to detect anomalies in different real-life domains that are safety critical and have low fault tolerance.

  • Cybersecurity: Can be used in cybersecurity to detect intrusion that can threaten the system. It can identify the unusual activities in the network that can be a potential security breach and notify about it.

  • Health care: Can be used in health care to detect abnormality in the patient's body or reports. It can identify the unusual occurring in a person's vitals and behavior, showing a problem. Adding to it, it can also analyze the lab reports and help in identifying the disease.

  • Climate sector: Can be used in environmental data monitoring to detect climate changes. It can identify anomalous changes in temperature, water, or air quality that can be used to make climate predictions.

Summary

Isolation Forest is an effective unsupervised technique to detect anomalies in a dataset and present it in a scatter plot. We can implement it using the following steps:

  • Generate data that can be split into test and training data.

  • Train the model to identify anomalies in the datasets.

  • Plot the decision boundaries that can be used to identify the inliers and outliers. We can use discrete and path-length decision boundaries, which differ in their response method.

Discrete vs. path length


Discrete

Path length

response_method

predict

decision_function

colorbar

does not require it.

pass the plot elements to it.

Test your understanding

Match The Answer
Select an option from the left-hand side

Outliers in path length decision boundary

ground truth label is assigned as 1

Inliers in path length decision boundary

measure of normality close to 0

Inliers in data generation

ground truth label is assigned as -1

Outliers in data generation

measure of normality close to 1


Free Resources

Copyright ©2024 Educative, Inc. All rights reserved