Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available
Building similarity graph...
Analyzing shared references across papers
Loading...
Dilxat Muhtar
Erlong Zhang
Zhenshi Li
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhtar et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68ecc715d1cc7436f7d18ba6 — DOI: https://doi.org/10.48550/arxiv.2503.00743
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: