Test-time adaptation is increasingly used to describe systems that change their predictions, prompts,normalization statistics, retrieval context, inference policy, or model state during deployment. Theattraction is clear: a model can meet a shifted environment without full retraining. The evidentialdifficulty is equally clear: a reported test-time gain can be caused by true online adaptation, hiddenvalidation, leakage from target data, benchmark-specific tuning, compute inflation, unstable repeatedruns, or degradation of previously learned capabilities. This paper introduces the Test-Time AdaptationEvidence Gap, the distance between apparent adaptation gain and the validated gain that remainsafter leakage, hidden validation, negative-control, transfer, retention, online-constraint, rollback,compute-normalization, drift, and repeat-stability gates are applied.
Tony Newton (Sun,) studied this question.