Building footprint extraction from satellite imagery is essential for urban planning, population estimation, and disaster response. While Mask Transformers such as Mask2Former have achieved state-of-the-art results on standard benchmarks, their deployment for global-scale building detection reveals a previously undocumented failure mode: class head miscalibration. We discover that when Mask2Former is trained with semantic-level labels rather than instance-level labels, the class prediction head becomes severely miscalibrated, producing maximum building confidence scores of only 8.5% despite accurate spatial predictions with 67% coverage. This miscalibration stems from the degeneration of Hungarian matching when semantic labels reduce all building instances to a single ground-truth object, creating an extreme 199:1 negative-to-positive class ratio. The result: standard post-processing produces zero detections (0% IoU). We propose CalibMask, a calibration-aware training framework that resolves this through automatic instance label generation via connected component analysis, calibrated weight transfer from a properly-trained regional model, and differential learning rate scheduling. Trained on 17,305 tiles spanning 9 countries, CalibMask achieves 50.1% IoU on USA, 49.6% on UK, and 53.8% on France (zero-shot), while restoring standard prediction to full functionality. A comprehensive ablation study confirms that instance labels are the most critical component (44% relative IoU drop without them), followed by calibrated transfer (19%).
elshater et al. (Thu,) studied this question.