Machine learning models trained on simulated MRI data are increasingly used to extractquantitative measurements from real brain scans. But can these models tell us when they mightbe wrong? We provide evidence that they usually cannot. When tested on real patient datawithout any fine-tuning, the models still produce reasonable measurements, but their confidenceestimates break down—they can no longer distinguish reliable predictions from unreliable ones.We call this the sim-to-real uncertainty gap. We demonstrate that this gap can be fixed witha quick calibration step using only 5% of real scan data. To help the community study and solvethis problem, we release qMR-FailureBench, a standardized benchmark of 60,000 simulatedMRI signals with five evaluation tasks. We also show that our system can not only detectunreliable measurements, but identify why they failed and attempt to correct them, reducingerrors by 39.6%
Sreenath P. Kyathanahally (Wed,) studied this question.