Adversarial Attacks on Explanations
Explore how adversarial attacks target explanations in machine learning, focusing on techniques like LIME and SHAP. Understand how attackers can manipulate explanation outputs to mislead human interpretation and compromise trust in model decisions, and learn key mitigation concepts.
We'll cover the following...
Like everything in ML, explainability has its own pitfalls. In academic circles, the largest controversy revolves around whether explainability actually contributes to understanding decision auditing at all. Because the goal of explainable AI is to foster trust and security around the algorithm, we must be able to rely on the output that these models provide to us. Otherwise, we doubt the model and its explanation.
Adversarial attacks on explainable models
We’ve discussed adversarial attacks on models already, but even the explanations of models can be manipulated. An adversary seeking to mislead or destroy human interpretation of an algorithm can attack explanations to make them useless or even incorrect. For example, exploits of LIME and SHAP take advantage of the methods’ slight perturbations of the black box.
LIME example
Recall that with LIME, a local explanation for a single decision is constructed by building a model over nearby data points. It generates these nearby data points synthetically (i.e., they’re not directly in the training set, they’re created separately as part of the LIME process).
Let’s call our original training set