With the rapid development of Earth observation satellite technology, remote sensing data show exponential growth in volume, diversity and resolution, and the traditional interpretation methods have been difficult to meet the demands of new applications in terms of real-time, accuracy and scalability. The breakthrough progress of Multimodal Foundation Models (MFMs) provides a technical paradigm for the construction of new generation remote sensing systems. As an important development direction in the field of artificial intelligence, remote sensing agents are capable of realizing cognitive functions such as perception, inference, planning and interaction based on remote sensing inputs, and they show significant technical advantages through mechanisms such as dynamic tool selection, contextual knowledge retrieval, inference chain generation and task goal adaptation. In this paper, we systematically sort out the technical architecture, system composition and application potential of this type of intelligences, focusing on key technical modules such as retrieval enhancement generation, chain-of-thought reasoning, and expert-in-the-loop optimization, and discussing the challenges and future directions in their technical evolution. The study shows that the multimodal foundation models, by deeply fusing remote sensing modal data such as synthetic aperture radar (SAR), optical images and hyperspectral images, has demonstrated transformative potentials in the fields of disaster emergency response, urban dynamics monitoring, and environmental intelligent analysis. These models not only realize the automated execution of complex analysis processes and effectively improve the precision of decision support, but also provide innovative solutions for the efficient generation of context-aware information.
Liu et al. (Mon,) studied this question.