Abstract Remote Sensing Image Change Captioning (RSICC) is a burgeoning task that aims to articulate change scenarios in bi-temporal remote sensing images using natural languages. The existing methods effectively capture feature differences between bi-temporal remote sensing images and realistic language decoders for accurate interpretation. Notably, not all regions exhibit changes in bi-temporal images, and the presence/absence of changes inherently imposes distinct difficulty levels on the RSICC tasks. Although several existing approaches have discussed this issue, they frequently exhibit the problem of unstable classification outcomes and feature loss during spatiotemporal joint modeling. This paper optimizes the classifier and implements a siamese network and dual-temporal image features fusion module, correlating spatial structures across temporal sequences comprehensively. The proposed framework enables efficient and reliable detection of changed bi-temporal image pairs and generates precise textual descriptions of the identified alterations. The proposed method achieves superior performance on public datasets compared to state-of-the-art methods.
Hu et al. (Fri,) studied this question.