Los puntos clave no están disponibles para este artículo en este momento.
Remote sensing image captioning plays a crucial role in understanding and interpreting the contents of images captured by aircraft, drones, satellites, and other devices. However, remote sensing images present unique challenges for semantic feature extraction compared to natural images. These challenges include scale variation, diverse object distribution, and varying object counts. In this research, a Convolutional Neural Network (CNN) model for image feature extraction is proposed. Additionally, an efficient image captioning framework is developed using the Transformer architecture to generate more accurate descriptions. The process involves generating image features using the proposed CNN model, which are then fed into the Transformer encoder. The Transformer encoder utilizes multi-level input image features from the CNN model. Subsequently, the encoded image features and text features are input to the Transformer decoder, which produces descriptive sentences. The proposed captioning model takes an image as input and generates a corresponding descriptive sentence as output. To validate the efficacy of the proposed approach, the experiments are conducted using the RSICD dataset. The results demonstrate that our model outperforms previous methods, establishing itself as the current state-of-the-art in image captioning performance.
Oo et al. (Sat,) studied this question.