Cross-view fine-grained localization estimates a ground camera's pixel-level coordinates in aerial images by analyzing visual correspondences between views. Recent studies have made significant progress in this task, but when the models trained in a source area are directly applied to a new target area, their localization performance often suffers significant degradation due to domain gap between the two areas. Moreover, obtaining accurate ground truth (GT) for the target area to retrain the models is prohibitively expensive. To adapt the localization model to the target area, this paper proposes a weakly supervised learning approach based on multi-teacher knowledge distillation. This approach utilizes multiple pre-trained teacher models to make predictions for the target area and employs a learning-free cross-view instance matching and view alignment (CVMA) module to evaluate the quality of predicted coordinates from geometric, semantic, and visual perspectives. Based on the evaluation results, the best prediction is selected as pseudo-GT, and potential anomalous training samples are filtered out. The CVMA module also functions as a learning-free fine-grained localization method, achieving performance comparable to some learning-based methods. Our approach is validated on the VIGOR benchmark using three state-of-the-art models, and experimental results show that our method significantly improves the localization performance of models in the target area.
Chen et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: