Every day, millions of images are seen through all these mediums-social media, news headlines, advertisements. People have an implicit sense of what those images represent, but for machines, it is possible to generate meaningful insights only through complex algorithms. Captioning images forms one of the most basic applications of AI and pertains to textual descriptions that can help in enabling functionalities like automatic indexing, CBIR, and accessibility. Deep learning models demonstrated potential capabilities to automatically learn features to generate semantically rich captions that are coherent; however, template-based, and retrieval-based approaches find it challenging to implement flexibly to produce ultra-high-detail, context-specific captions. The techniques here, such as CNNs, extract visual features while RNNs and LSTMs generate descriptive text. The higher-level architectures included are the encoder-decoder frameworks and compositional models that provide further enhancement by aligning visual data and textual data. The paper briefly discusses deep learning techniques categorized into structure and application-based categories and tests the performance of benchmark datasets such as Flickr8k, Flickr30k, and MSCOCO. However, much remains to be done in terms of building models robust to complex and diverse visual content; thus, it is observed that there are challenges that carry forward to the work on multimodal integration and attention-based mechanisms to be improved in terms of better quality and accuracy by the captions.
Chawla et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: