Key points are not available for this paper at this time.
Large multi-modal models, empowered with both textual and visual inputs, have shown tremendous capabilities in a wide range of vision and language tasks. This kind of vision-text models, extensively studied in natural settings, are receiving less attention in the remote sensing (RS) field. Oftentimes, RS research relies on models for natural scenarios, glossing over the potential improvements of systems tailored to the remote sensing scenario. In this paper, we push toward bridging this gap. First, we design a procedure to generate a large-scale RS image-text dataset with synthetic captions. We use it to fine-tune a targeted CLIP model and we analyze the effect of using only synthetic captions on the model capabilities. Lastly, we build a benchmark for remote sensing image-text models, and evaluate our model, along with other recently proposed in the literature. We release the code of our benchmark system and our dataset.
Ricci et al. (Mon,) studied this question.