What question did this study set out to answer?

This research aims to improve underwater image captioning through enhanced feature extraction and fusion.

January 20, 2026

Large Foundation Model Empowered Region-aware Underwater Image Captioning

Puntos clave

This research aims to improve underwater image captioning through enhanced feature extraction and fusion.
Developed a region-discriminative feature extraction strategy using the large foundation segment anything model (SAM)
Implemented a region-guided feature fusion strategy for encoding-decoding processes
Conducted experimental evaluations on three datasets to assess performance
Achieved state-of-the-art performance for underwater image captioning
Successfully delineated object and background regions for precise feature extraction
Generated accurate and comprehensive captions for underwater images

Resumen

Underwater image captioning facilitates the transformation from visual perception to semantic understanding in underwater computer vision. Despite advancements in this field, challenges remain in generating high-quality captions for underwater images. These challenges typically stem from (a) ambiguity between object and background regions for feature extraction, and (b) insufficient feature fusion across all regions. To address these challenges, we develop a large foundation model empowered region-aware underwater image captioning framework. Our novel contributions are two-fold: (a) A region-discriminative feature extraction strategy powered by the large foundation segment anything model (SAM) is developed. This strategy accurately delineates object and background regions through segmentation maps, enabling precise extraction of region-discriminative features. (b) A region-guided feature fusion strategy comprehensively fusing regional information throughout an encoding-decoding process is presented. This strategy utilizes a region-guided encoder for the progressive layer-wise fusion of region-discriminative features and grid features, followed by a meshed memory decoder that fuses multi-level encoded features, thereby enhancing the decoded features. Together, these contributions result in the generation of accurate and comprehensive underwater image captions. Experimental evaluations on three datasets demonstrate that our proposed framework achieves state-of-the-art performance for underwater image captioning.

Me gusta

Guardar

Cite This Study

Li et al. (Sat,) studied this question.

synapsesocial.com/papers/696ed06d6d8d470fca57ab8b https://doi.org/https://doi.org/10.1007/s11263-025-02650-w

Me gusta

Guardar