Underwater image captioning facilitates the transformation from visual perception to semantic understanding in underwater computer vision. Despite advancements in this field, challenges remain in generating high-quality captions for underwater images. These challenges typically stem from (a) ambiguity between object and background regions for feature extraction, and (b) insufficient feature fusion across all regions. To address these challenges, we develop a large foundation model empowered region-aware underwater image captioning framework. Our novel contributions are two-fold: (a) A region-discriminative feature extraction strategy powered by the large foundation segment anything model (SAM) is developed. This strategy accurately delineates object and background regions through segmentation maps, enabling precise extraction of region-discriminative features. (b) A region-guided feature fusion strategy comprehensively fusing regional information throughout an encoding-decoding process is presented. This strategy utilizes a region-guided encoder for the progressive layer-wise fusion of region-discriminative features and grid features, followed by a meshed memory decoder that fuses multi-level encoded features, thereby enhancing the decoded features. Together, these contributions result in the generation of accurate and comprehensive underwater image captions. Experimental evaluations on three datasets demonstrate that our proposed framework achieves state-of-the-art performance for underwater image captioning.
Li et al. (Sat,) studied this question.