Faithful, Stable, Complete: Pick Two The Problem in Plain Language When a machine learning model makes a prediction — approving a loan, diagnosing a disease, flagging a transaction — practitioners use a tool called SHAP to answer "which input features mattered most?" SHAP is the most widely used explanation method in machine learning. Here is the problem: retrain the same model on the same data with a different random seed, and the explanation changes. The model's predictions barely move, but the "most important feature" can flip entirely. In 68% of 77 public datasets, the top feature is not stable across retrains. This is not a software bug. This is not fixable by tuning hyperparameters. We prove it is a mathematical impossibility. What We Prove No feature ranking can simultaneously be: Faithful — it reflects what the model actually learned Stable — it doesn't change when you retrain Complete — it ranks every pair of features …when features are correlated with similar importance. You must give up one. The proof is four lines long. It requires no assumptions about the model, the data, or the explanation method — only that correlated features admit models ranking them in opposite orders (the Rashomon property), which is true for every standard ML algorithm. How Bad Is It? We trained 50 XGBoost models on Breast Cancer Wisconsin — the dataset used in every SHAP tutorial — and counted how many different "top 3 most important features" appeared. Twenty-four. At 100 models: thirty-five. The "most common" answer appeared in only 12% of runs. Two randomly chosen models agree on the top-3 only 4.2% of the time. Every tutorial, textbook, and blog post showing SHAP on this dataset is showing one of two dozen equally valid answers. Three other datasets (California Housing, Heart Disease, Wine Quality) produce exactly one ranking every time — because their top features have clearly different importance. The theory correctly predicts which datasets are affected and which are safe. Dataset Distinct top-3 rankings (50 models) Two models agree? Breast Cancer 24 4.2% Diabetes 2 88.5% Wine Quality 1 100% (stable) Heart Disease 1 100% (stable) California Housing 1 100% (stable) It Gets Worse for Yes/No Questions For ranking questions (which feature is MORE important?), there is a fix: average across multiple models. But for binary questions — "does this feature contribute positively or negatively?", "is this feature selected?" — no fix exists. Even averaging doesn't help, because there's no middle ground between "positive" and "negative." We call this the bilemma. Real-World Consequences For loan applicants. We trained 30 models on German Credit data. Under standard settings, 45% of applicants receive a different "most important reason" for their decision depending on which model happens to be deployed. One applicant received six different top reasons across 30 models. For biomarker discovery. On a dataset of 10,935 genes distinguishing colon from kidney tissue, the "#1 most important gene" alternates between TSPAN8 (involved in tumor invasion) and CEACAM5/CEA (involved in immune evasion) depending on the random seed. A drug discovery pipeline targeting one gene makes a different bet than one targeting the other — and which bet gets made depends on a random number. For fairness audits. A SHAP-based audit checking whether a model relies on a protected attribute (like race or gender) reaches its conclusion with the reliability of a coin flip when the protected attribute is correlated with other features. The Fix DASH (Diversified Aggregation of SHAP): train 25 models with different seeds, average their SHAP values. This is provably the best possible approach — no method can do better. Features that genuinely differ in importance get stable rankings. Features that are interchangeable get reported as tied, which is the honest answer. We also provide a 7-line diagnostic that identifies which features are at risk, requiring no statistical expertise and no assumptions about the data distribution. It outperforms the standard formula by 2× on real data. The practical workflow: Screen your model (1 model, seconds) Run the minority fraction diagnostic (7 lines of code) For flagged features, train 5 models and run a Z-test If unstable, use DASH with 25+ models Machine Verification Every mathematical claim is checked by a computer. The proofs are written in Lean 4 (a programming language for mathematics) and verified by its type-checker: 357 theorems, all machine-verified 6 axioms (the minimal assumptions the theory needs) Zero unproved claims across 58 files During the formalization, the computer caught two logical errors and one type mismatch that human reviewers missed. To our knowledge, this is the first formally verified impossibility result in explainable AI. Technical Details Architecture-dependent bounds Gradient boosting (XGBoost, LightGBM): instability diverges as correlation increases. At ρ = 0.9, the dominant feature gets 5× its fair share. Lasso: the ratio is infinite — one correlated feature gets everything, the other gets zero. Neural networks: 87% of feature pairs are unstable. Model instability dominates SHAP estimation noise by 8:1. Random forests: instability converges with more trees — the contrast case showing that parallel (not sequential) training helps. Cross-implementation. XGBoost, LightGBM, and Random Forest all show the same instability pattern. It is not specific to any one software package. Subsample sensitivity. Even at subsample = 0.95 (minimal randomness), 17 distinct rankings remain. Only fully deterministic training (subsample = 1.0) produces one ranking — but this sacrifices the regularization that makes the model accurate. Mechanistic interpretability. Preliminary evidence suggests the impossibility extends beyond feature importance to neural network circuit analysis. 10 transformers trained on modular addition (all achieving 100% accuracy) agree on only 36% of the top-3 circuit components. Design Space The achievable set of explanation methods has exactly two families: Family A (single model): faithful and complete, but unstable. Rankings flip up to 50% of the time. This is what standard SHAP does. Family B (DASH ensemble): faithful and stable, but reports ties for indistinguishable features. This is what DASH does. No third option exists. DASH is provably the best method in Family B. Associated Papers Companion paper (TMLR, under review). First-Mover Bias in Gradient Boosting Explanations: Mechanism, Detection, and Resolution.arXiv: https://arxiv.org/abs/2603.22346DOI: https://doi.org/10.5281/zenodo.19446088 Companion implementation: https://github.com/DrakeCaraker/dash-shap
Caraker et al. (Thu,) studied this question.