What question did this study set out to answer?

The goal is to enhance object counting in remote sensing images despite challenges like small target recognition and semantic ambiguities.

March 18, 2026Open Access

DR-CLIP: A Deformable Vision–Language Model for Scale-Invariant Object Counting in Remote Sensing Images

Key Points

The goal is to enhance object counting in remote sensing images despite challenges like small target recognition and semantic ambiguities.
Developed a vision-language model called DR-CLIP that integrates deformable visual feature extraction.
Implemented a Region-to-Instruction (R2I) mechanism for unified image-text representation.
Utilized Multi-scale Deformable Attention (MSDA) for improved feature extraction across varying scales.
Employed a Text-Guided Counting Head for robust cross-modal alignment through contrastive learning.
Achieved a Mean Absolute Error (MAE) of 2.34 and RMSE of 3.89, outperforming baseline methods by 19% in MAE.
Increased Small-Object Recall (SOR) to 0.824, improving counting of dense small objects.
Attained R@1 scores of 68.3% for image-to-text and 72.1% for text-to-image in cross-modal retrieval.
Showed only 8.7% performance degradation in cross-domain tests, significantly better than the 23.4% drop in baseline methods.

Abstract

Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision–language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image–text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods.

DR-CLIP: A Deformable Vision–Language Model for Scale-Invariant Object Counting in Remote Sensing Images

Key Points

Abstract

Cite This Study