Multimodal remote sensing image classification improves models' capacity to recognize complex land-cover patterns by integrating data from heterogeneous sensors such as hyperspectral image (HSI) and light detection and ranging (LiDAR). However, many existing classification models ignore aleatoric and epistemic uncertainties introduced during data acquisition and labeling. As a result, they become less robust to noise and more vulnerable to spurious correlations, which ultimately weakens their ability to generalize to unseen data. To mitigate these issues, multimodal joint distribution modeling is reformulated as a flow matching optimization problem that learns a distribution evolution process under an unknown target distribution. A reinforcement learning-driven flow matching (RL-FM) framework is proposed for multimodal remote sensing image classification. Specifically, feature distributions of remote sensing images from different modalities are first modeled using variational autoencoders, and a multimodal mixture distribution is then constructed via Gaussian mixture strategy to serve as an initial distribution for flow matching. To perform flow matching optimization when target distribution is unknown, label information is further exploited to guide the transformation of initial distribution toward target distribution. At the same time, the distribution evolution process of flow matching is formulated as a Markov decision process (MDP), enabling the model to learn an evolution path from initial distribution to target distribution by maximizing the expected cumulative reward. RL-FM jointly accounts for immediate classification loss and long-term generalization performance, thereby alleviating suboptimal convergence caused by myopic gradient updates. Furthermore, by incorporating counterfactual causal inference into policy optimization, a counterfactual proximal policy optimization (CPPO) is designed. CPPO can strengthen the model capacity to capture the causal relationship between action and reward, thus improving generalization in scenarios with limited labeled samples. Experimental results on multiple benchmark datasets demonstrate that the proposed RL-FM achieves state-of-the-art performance on multimodal remote sensing image classification tasks. The code is available at: https://github.com/zwdmw/RL-FM.
Zhang et al. (Thu,) studied this question.