Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs), which leverage self-supervised learning on large-scale unlabeled datasets, has exhibited superior capabilities in extracting informative, general-purpose features compared to supervised encoders. However, their potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a cross-modal spatial prior descriptor (CSPD), enabling enhanced extraction of complementary heterogeneous features. The extracted features undergo dual-path feature fusion through our proposed progressive heterogeneous feature integrators. Moreover, we introduce an auxiliary task to further enrich the local semantics of fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, incorporating all these components, delivers superior performance under challenging illumination conditions. Extensive experiments demonstrate that HAPNet outperforms all other state-of-the-art methods, with improvements of 0.1%, 1.0%, and 2.4% in mIoU on three public RGB-thermal scene parsing datasets: MFNet, PST900, and KP Day-Night, respectively. Additionally, our method exhibits exceptional generalizability for RGB-HHA scene parsing. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
Li et al. (Sun,) studied this question.