Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision–language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image–text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods.
Nie et al. (Mon,) studied this question.