Introduction: Sentiment analysis is an essential task in text mining and information retrieval, with applications spanning e-commerce, social media analytics, and political discourse. Despite extensive progress in monolingual settings, sentiment analysis of codemixed text remains a challenge due to linguistic blending, the absence of standardized resources, and the highly unstructured nature of social media data. This study addresses this gap by focusing on sentiment analysis in English–Marathi–Konkani (En–Mr–Kn), a relatively underexplored language triad in multilingual natural language processing (NLP). Methods: A gold-standard annotated corpus was developed by collecting 31,884 code-mixed sentences from diverse social media platforms, including YouTube, WhatsApp, and Facebook. The sentences were manually annotated for sentiment polarity, with reliability validated through Krippendorff’s Alpha, yielding a high inter-annotator agreement score of 0.911. To benchmark the dataset, multiple sentiment classification approaches were implemented, ranging from traditional machine learning to deep learning and transformer-based models. Results: Among the evaluated models, the multilingual BERT (mBERT), fine-tuned on both standard and code-mixed data, achieved the highest classification accuracy of 85.3%. Other models, including classical and deep learning techniques, demonstrated competitive but comparatively lower performance. Discussion: The findings highlight the effectiveness of transformer-based models in handling code-mixed Indo-Aryan languages, particularly when combined with curated datasets that capture authentic linguistic variation. The performance gap between mBERT and conventional approaches underscores the need for advanced architectures to tackle the complexity of codemixed sentiment analysis. Conclusion: This work presents the first annotated corpus for En–Mr–Kn sentiment analysis, thereby filling a critical gap in multilingual NLP. The publicly released dataset and baseline implementations provide a valuable foundation for future research, supporting the advancement of sentiment classification in low-resource, code-mixed language settings.
Phadte et al. (Thu,) studied this question.