Image captioning is the ability to generate concise natural language descriptions of given images. It integrates computer vision and natural language processing, two cutting-edge artificial intelligence disciplines. Image captioning is nowadays widely used in vision assistance, healthcare, remote sensing, and security. While English image captioning has advanced across domains, Hindi image captioning remains underdeveloped, lacking features like autonomy, ensemble extraction, hybrid attention, and vision-boosted decoding. Additionally, integrating Hindi image captioning into vision aid tools is infeasible due to the lack of real-time and multi captioning ability. This research introduces ChitraVivran, a novel real-time, end-to-end Hindi image captioning framework. Our framework employs an ensemble visual feature extraction module to generate boosted contextual descriptors, enriching the fusion of visual and semantic embeddings. A dataset named PASCAL 1K-Hindi has also been manually created by translating the PASCAL 1K-English image captioning dataset into Hindi. Various pipeline configurations, confining ensemble feature extractors, attention mechanisms, and decoders, have also been developed and tested for Hindi image captioning. To enhance the applicability of Hindi image captioning in vision aid tools, our framework also incorporates real-time captioning and customized multi-captioning support. Experimental analysis on the Flickr 8K-Hindi and our newly developed PASCAL 1K-Hindi dataset indicates that ChitraVivran produces improved quantitative (Bilingual Evaluation Understudy-3(26.73%), Bilingual Evaluation Understudy-4(16.82%)) and qualitative results against baselines. Our framework demonstrates high performance in real-time captioning.
Sharma et al. (Wed,) studied this question.