The accurate interpretation of land cover changes in multi-temporal satellite imagery is critical for Earth observation. However, existing methods typically yield static outputs—such as binary masks or fixed captions—lacking interactivity and user guidance. To address this limitation, we introduce remote sensing image change analysis (RSICA), a novel paradigm that enables the instruction-guided, multi-turn exploration of temporal differences in bi-temporal images through visual question answering. To realize RSICA, we propose DeltaVLM, a vision language model specifically designed for interactive change understanding. DeltaVLM comprises three key components: (1) a fine-tuned bi-temporal vision encoder that independently extracts semantic features from each image in the input pair; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former that selects query-relevant change features and aligns them with a frozen large language model to generate context-aware responses. We also present ChangeChat-105k, a large-scale instruction-following dataset containing over 105k diverse samples. Extensive experiments show that DeltaVLM achieves state-of-the-art performance in both single-turn captioning and multi-turn interactive change analysis, surpassing both general multimodal models and specialized remote sensing vision language models.
Deng et al. (Sun,) studied this question.