What does this research mean for the field?

The Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) improves structural fidelity, brightness recovery, and semantic naturalness in low-light image enhancement by integrating pseudo-infrared structural priors and LMM-driven scene-adaptive attention. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance low-light images by addressing structural inconsistency and semantic drift in existing methods.

March 27, 2026Open Access

MSSA-Net: Multi-Modal Structural and Semantic-Adaptive Network for Low-Light Image Enhancement

Key Points

The aim is to enhance low-light images by addressing structural inconsistency and semantic drift in existing methods.
Designed Multi-Scale Self-Refinement Block for feature extraction and refinement.
Introduced a pseudo-infrared structural prior for providing geometric cues.
Utilized Structure-Guided Cross-Attention to extract structure-dominant features.
Implemented adaptive residual fusion for integrating refined and structural features.
Developed a scene-adaptive attention mechanism for generating instance-aware scene tags.
MSSA-Net improves structural fidelity compared to existing methods.
Achieves significant brightness recovery in low-light images.
Enhances semantic naturalness across multiple benchmarks.

Abstract

Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform enhancement strategies or static text prompts. To address these issues, we propose a Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) under a structure-anchored paradigm. First, we design a Multi-Scale Self-Refinement Block (MSRB) to enhance degraded visible representations through multi-scale feature extraction and progressive refinement. Meanwhile, a pseudo-infrared structural prior derived from the input image is introduced to provide noise-insensitive geometric cues. These cues are extracted via a Structure-Guided Cross-Attention (SGCA) module to produce structure-dominant features. The refined visible features and structural features are then adaptively integrated through an adaptive residual fusion (ARF) module to achieve balanced restoration. Furthermore, we develop a Large Multi-modal Model (LMM)-Driven Scene-Adaptive Attention mechanism that generates instance-aware scene tags from a coarse preview and injects semantic embeddings into visual features. Extensive experiments demonstrate that MSSA-Net improves structural fidelity, brightness recovery, and semantic naturalness across multiple benchmarks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper