What question did this study set out to answer?

This research aims to evaluate how adversarial perturbations affect the stability of SHAP explanations in health risk prediction models.

May 26, 2026Open Access

SHAP Explanation Instability Under Adversarial Perturbation in Health Risk Prediction Models

Key Points

This research aims to evaluate how adversarial perturbations affect the stability of SHAP explanations in health risk prediction models.
Examined stability of SHAP explanations against adversarial attacks in diabetes, heart disease, and stroke models.
Applied HopSkipJump attack on Random Forest and a greedy search on logistic regression models.
Tested defenses including adversarial training and inference-time winsorization.
Adversarial attacks modify SHAP attributions, altering key features for clinicians in the diabetes model.
Adversarial training successfully neutralized all attacks across all models tested.
Winsorization defended against specific linear model attacks but not against others, highlighting asymmetry in defense.

Abstract

Clinical deployment of machine learning models introduces an important trust problem. Adversarial perturbations silently manipulate predictions leading clinicians who rely on SHAP explanations to interpret those predictions to be misled not just by the outcome, but by the reasoning presented to them. While adversarial vulnerability at the prediction level is well-documented, its effect on SHAP-based explanations and clinical interpretability has received limited systematic study. This research examines the stability of SHAP explanations when subjected to adversarial perturbations. We focus on three models: a Random Forest classifier for diabetes, and Logistic Regression models for heart disease and stroke. Applying clinically plausible constraints, we launched two attack methods i.e. HopSkipJump for the Random Forest, and a greedy constrained search in the raw feature space for the linear models. We also test two defensive strategies against these attacks: adversarial training and inference-time winsorisation. Our results confirm that successful attacks exploit high-mutability inputs, not global importance, and produce changes in SHAP attributions that alter the most important feature available to clinicians as illustrated for the diabetes model. Adversarial training eliminated all tested attacks for all three models. Winsorization provided protection only against attacks that saturated bounds for linear models, but failed completely against distributed, low-magnitude perturbations used to attack the random forest, and thus reflected a substantial asymmetry of defense coverage that should be considered when implementing explanation-based clinical systems. Attempts at targeted evasion of explanation under post-defense conditions were also highly difficult, consistent with the notion that limits on clinical plausibility reduce the capacity for active manipulation of explanation.

SHAP Explanation Instability Under Adversarial Perturbation in Health Risk Prediction Models

Key Points

Abstract

Cite This Study