What question did this study set out to answer?

The study aims to improve RGB-thermal scene parsing by utilizing advanced VFM features and innovative fusion strategies.

March 23, 2026Open Access

HAPNet: Toward superior RGB-thermal scene parsing via hybrid, asymmetric, and progressive heterogeneous feature fusion

Key Points

The study aims to improve RGB-thermal scene parsing by utilizing advanced VFM features and innovative fusion strategies.
Developed a hybrid, asymmetric encoder integrating vision foundation models and a cross-modal spatial prior descriptor.
Implemented dual-path feature fusion using progressive heterogeneous feature integrators.
Introduced an auxiliary task to enhance local semantics of fused features.
HAPNet achieved improvements of 0.1%, 1.0%, and 2.4% in mean Intersection over Union (mIoU) on MFNet, PST900, and KP Day-Night datasets.
Demonstrated superior performance compared to state-of-the-art methods.
Exhibited strong generalizability for RGB-HHA scene parsing tasks.

Abstract

Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs), which leverage self-supervised learning on large-scale unlabeled datasets, has exhibited superior capabilities in extracting informative, general-purpose features compared to supervised encoders. However, their potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a cross-modal spatial prior descriptor (CSPD), enabling enhanced extraction of complementary heterogeneous features. The extracted features undergo dual-path feature fusion through our proposed progressive heterogeneous feature integrators. Moreover, we introduce an auxiliary task to further enrich the local semantics of fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, incorporating all these components, delivers superior performance under challenging illumination conditions. Extensive experiments demonstrate that HAPNet outperforms all other state-of-the-art methods, with improvements of 0.1%, 1.0%, and 2.4% in mIoU on three public RGB-thermal scene parsing datasets: MFNet, PST900, and KP Day-Night, respectively. Additionally, our method exhibits exceptional generalizability for RGB-HHA scene parsing. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.

HAPNet: Toward superior RGB-thermal scene parsing via hybrid, asymmetric, and progressive heterogeneous feature fusion

Key Points

Abstract

Cite This Study