This technical note presents a controlled three-family stress test on MMLU-Pro under admissible answer-interface perturbations (baseline, choiceₛhuffle, labelᵣemap). Three model families are evaluated on a locked subset of 140 items spanning 14 categories. A methodological caveat affecting prediction comparability is explicitly identified and corrected through full prediction-space canonicalization with exact decoder recovery. The family-level perturbation signatures remain unchanged after canonicalization, while part of the raw prediction-level instability is reduced but not eliminated. The result is diagnostic and local in scope: under the tested setup, MMLU-Pro remains locally usable but exhibits interface-sensitive evaluative closure and limited global neutrality under the tested perturbations.
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella
Building similarity graph...
Analyzing shared references across papers
Loading...
Danilo Tavella (Tue,) studied this question.
www.synapsesocial.com/papers/69cf5e115a333a821460c332 — DOI: https://doi.org/10.5281/zenodo.19353380