Adversarial Attacks on Explanations

Learn about the problems with explainability.

Like everything in ML, explainability has its own pitfalls. In academic circles, the largest controversy revolves around whether explainability actually contributes to understanding decision auditing at all. Because the goal of explainable AI is to foster trust and security around the algorithm, we must be able to rely on the output that these models provide to us. Otherwise, we doubt the model and its explanation.

Adversarial attacks on explainable models

We’ve discussed adversarial attacks on models already, but even the explanations of models can be manipulated. An adversary seeking to mislead or destroy human interpretation of an algorithm can attack explanations to make them useless or even incorrect. For example, exploits of LIME and SHAP take advantage of the methods’ slight perturbations of the black box.

LIME example

Recall that with LIME, a local explanation for a single decision is constructed by building a model over nearby data points. It generates these nearby data points synthetically (i.e., they’re not directly in the training set, they’re created separately as part of the LIME process).

Let’s call our original training set XX, our trained model MtM_t, and our decision point x0x_0. Lime takes x0x_0 and generates points using the distributions contained in XX, which it then runs through MtM_t to generate a local decision boundary.

Now, let’s consider another model, MaM_a, that has been substituted for MtM_t by an attacker. This model behaves the same as MtM_t, with one crucial difference: on data not directly in XX, it changes its behavior. This attack takes advantage of LIME’s synthetic approach. If a point is recognized as synthetic, MaM_a can return different outputs compared to what MtM_t would produce, and it would be unnoticeable because there’s no ground truth recorded in XX for synthetic points.

Get hands-on with 1300+ tech skills courses.