What question did this study set out to answer?

The study aims to address the challenges in sentiment analysis of English-Marathi-Konkani code-mixed texts.

February 26, 2026

Sentiment Analysis of English-Marathi-Konkani Code-Mixed Texts: A Corpus and Model Evaluation Study

Key Points

The study aims to address the challenges in sentiment analysis of English-Marathi-Konkani code-mixed texts.
Developed a gold-standard annotated corpus with 31,884 code-mixed sentences from social media.
Manually annotated sentiment polarity and validated reliability with Krippendorff’s Alpha.
Implemented various sentiment classification approaches, including machine learning and deep learning methods.
The multilingual BERT model achieved the highest classification accuracy of 85.3%.
Other models exhibited competitive performance, but ranks lower than mBERT.
High inter-annotator agreement score of 0.911 demonstrates reliable annotations.

Abstract

Introduction: Sentiment analysis is an essential task in text mining and information retrieval, with applications spanning e-commerce, social media analytics, and political discourse. Despite extensive progress in monolingual settings, sentiment analysis of codemixed text remains a challenge due to linguistic blending, the absence of standardized resources, and the highly unstructured nature of social media data. This study addresses this gap by focusing on sentiment analysis in English–Marathi–Konkani (En–Mr–Kn), a relatively underexplored language triad in multilingual natural language processing (NLP). Methods: A gold-standard annotated corpus was developed by collecting 31,884 code-mixed sentences from diverse social media platforms, including YouTube, WhatsApp, and Facebook. The sentences were manually annotated for sentiment polarity, with reliability validated through Krippendorff’s Alpha, yielding a high inter-annotator agreement score of 0.911. To benchmark the dataset, multiple sentiment classification approaches were implemented, ranging from traditional machine learning to deep learning and transformer-based models. Results: Among the evaluated models, the multilingual BERT (mBERT), fine-tuned on both standard and code-mixed data, achieved the highest classification accuracy of 85.3%. Other models, including classical and deep learning techniques, demonstrated competitive but comparatively lower performance. Discussion: The findings highlight the effectiveness of transformer-based models in handling code-mixed Indo-Aryan languages, particularly when combined with curated datasets that capture authentic linguistic variation. The performance gap between mBERT and conventional approaches underscores the need for advanced architectures to tackle the complexity of codemixed sentiment analysis. Conclusion: This work presents the first annotated corpus for En–Mr–Kn sentiment analysis, thereby filling a critical gap in multilingual NLP. The publicly released dataset and baseline implementations provide a valuable foundation for future research, supporting the advancement of sentiment classification in low-resource, code-mixed language settings.

Bookmark

Sentiment Analysis of English-Marathi-Konkani Code-Mixed Texts: A Corpus and Model Evaluation Study

Key Points

Abstract

Cite This Study