Clinical deployment of machine learning models introduces an important trust problem. Adversarial perturbations silently manipulate predictions leading clinicians who rely on SHAP explanations to interpret those predictions to be misled not just by the outcome, but by the reasoning presented to them. While adversarial vulnerability at the prediction level is well-documented, its effect on SHAP-based explanations and clinical interpretability has received limited systematic study. This research examines the stability of SHAP explanations when subjected to adversarial perturbations. We focus on three models: a Random Forest classifier for diabetes, and Logistic Regression models for heart disease and stroke. Applying clinically plausible constraints, we launched two attack methods i.e. HopSkipJump for the Random Forest, and a greedy constrained search in the raw feature space for the linear models. We also test two defensive strategies against these attacks: adversarial training and inference-time winsorisation. Our results confirm that successful attacks exploit high-mutability inputs, not global importance, and produce changes in SHAP attributions that alter the most important feature available to clinicians as illustrated for the diabetes model. Adversarial training eliminated all tested attacks for all three models. Winsorization provided protection only against attacks that saturated bounds for linear models, but failed completely against distributed, low-magnitude perturbations used to attack the random forest, and thus reflected a substantial asymmetry of defense coverage that should be considered when implementing explanation-based clinical systems. Attempts at targeted evasion of explanation under post-defense conditions were also highly difficult, consistent with the notion that limits on clinical plausibility reduce the capacity for active manipulation of explanation.
Samuel Zachary Karikari (Sat,) studied this question.