What question did this study set out to answer?

This research aims to examine how MMLU-Pro performs when exposed to specific interface perturbations while identifying issues affecting prediction comparability.

April 3, 2026Open Access

MMLU-Pro Under Admissible Interface Perturbations: A Three-Family Stress Test with Prediction-Space Canonicalization

Key Points

This research aims to examine how MMLU-Pro performs when exposed to specific interface perturbations while identifying issues affecting prediction comparability.
Conducted a three-family stress test on MMLU-Pro
Evaluated three different perturbations: baseline, choice_shuffle, and label_remap
Used a locked subset of 140 items across 14 categories
Applied prediction-space canonicalization to address methodological caveats
Interface-sensitive evaluative closure was observed in MMLU-Pro
Raw prediction-level instability was partially reduced through canonicalization
Family-level perturbation signatures remained unchanged after canonicalization
MMLU-Pro showed limited global neutrality under the tested perturbations

Abstract

This technical note presents a controlled three-family stress test on MMLU-Pro under admissible answer-interface perturbations (baseline, choiceₛhuffle, labelᵣemap). Three model families are evaluated on a locked subset of 140 items spanning 14 categories. A methodological caveat affecting prediction comparability is explicitly identified and corrected through full prediction-space canonicalization with exact decoder recovery. The family-level perturbation signatures remain unchanged after canonicalization, while part of the raw prediction-level instability is reduced but not eliminated. The result is diagnostic and local in scope: under the tested setup, MMLU-Pro remains locally usable but exhibits interface-sensitive evaluative closure and limited global neutrality under the tested perturbations.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Danilo Tavella

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MMLU-Pro Under Admissible Interface Perturbations: A Three-Family Stress Test with Prediction-Space Canonicalization

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study