We address the ”black-box problem” in LLMs by tracing outputs to the behavior of theirinternal states in a way that is stable, causal, and trajectory-aware.1 Existing attribution methods (IG, SHAP, attention weights) analyze single forward passes, ignore trajectory multiplicity,lack stability under variation, and lack reverse probabilistic admissibility. We introduce ReverseMarkov Chains (RMC), a post-hoc framework that integrates Integrated Gradients (local sensitivity), L3-Shapley values (coalitional causality), and reverse posterior weighting (trajectoryplausibility). We show that reverse posterior weighting stabilizes attribution across multiple forward trajectories that yield identical outputs. Theoretical guarantees follow from axiomatic IGsensitivity and L3-Shapley admissibility under an SCM approximation.
Gabiro Arnauld (Sat,) studied this question.