Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform enhancement strategies or static text prompts. To address these issues, we propose a Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) under a structure-anchored paradigm. First, we design a Multi-Scale Self-Refinement Block (MSRB) to enhance degraded visible representations through multi-scale feature extraction and progressive refinement. Meanwhile, a pseudo-infrared structural prior derived from the input image is introduced to provide noise-insensitive geometric cues. These cues are extracted via a Structure-Guided Cross-Attention (SGCA) module to produce structure-dominant features. The refined visible features and structural features are then adaptively integrated through an adaptive residual fusion (ARF) module to achieve balanced restoration. Furthermore, we develop a Large Multi-modal Model (LMM)-Driven Scene-Adaptive Attention mechanism that generates instance-aware scene tags from a coarse preview and injects semantic embeddings into visual features. Extensive experiments demonstrate that MSSA-Net improves structural fidelity, brightness recovery, and semantic naturalness across multiple benchmarks.
Chen et al. (Wed,) studied this question.