Interpreting H2O Isolation Forest Model

Understand how to build explainable anomaly detection models with H2O.

Interpreting anomalies

There are two levels of interpretation:

  • Dataset level: High-level understanding of what segments of data are considered anomalous.

  • Record level: Understanding of why an individual record is considered anomalous.

We’ll start with the dataset level. Our goal is to gain an understanding of what segments of data are considered anomalous.

Dataset level

Once we have found the anomalies in our dataset, the next step is to understand why they are considered anomalies. To do this, we’ll train a decision tree. This will transform the unsupervised problem into a supervised one, where decision trees will help us uncover relationships between features and how they led us to the anomalies in our dataset.

The purpose of the decision tree is to find records with the anomaly flag. To do this, it will find segments of similar anomalies and discover how to separate them from nonanomalous records.

The steps of interpreting anomalies on a dataset level are:

  1. Create a target column that indicates whether the record was considered an anomaly.

  2. Train a decision tree to predict the anomaly flag.

  3. Visualize the decision tree to determine which segments of the data are considered anomalous.

In our first step, we’ll add a column called is_anomaly. This is a flag that indicates whether the isolation forest considered the record an anomaly.

Get hands-on with 1400+ tech skills courses.