What question did this study set out to answer?

The research aims to evaluate the effectiveness of fine-tuned Hindi-BERT for sentiment analysis and topic classification in Hindi text.

May 22, 2026Open Access

Fine-Tuning Multilingual BERT for Hindi Text Classification: Sentiment Analysis and Topic Categorisation Using the HindiSentiment-6 Corpus

Key Points

The research aims to evaluate the effectiveness of fine-tuned Hindi-BERT for sentiment analysis and topic classification in Hindi text.
Developed and utilized the HindiSentiment-6 corpus consisting of 12,000 Hindi documents.
Conducted comprehensive evaluations of fine-tuned Hindi-BERT against various baselines.
Analyzed attention weights to assess model understanding of sentiment-bearing words.
Hindi-BERT achieved 95.6% accuracy and 95.3% macro-F1 score for topic classification.
Hindi-BERT showed 91.2% accuracy in sentiment analysis.
Fine-tuned Hindi-BERT outperformed several models including TF-IDF+SVM (78.4%) and fastText (83.1%).

Abstract

Hindi is the most widely spoken language in India with approximately 528 million native speakers and 600 million total speakers, yet natural language processing (NLP) resources and pre-trained language models for Hindi remain substantially less developed than those for English, limiting the deployment of AI-driven text analysis applications in governance, healthcare, education, and digital commerce in Hindi-speaking markets. Transformer-based pre-trained language models — particularly multilingual BERT (mBERT) and its Hindi-specific variant Hindi-BERT — offer a transfer learning pathway for Hindi NLP tasks, but systematic comparison of fine-tuning strategies, domain generalisation, and performance across classification tasks remains limited in the published literature for Indian language NLP. This paper presents a comprehensive evaluation of fine-tuned Hindi-BERT for two text classification tasks: six-class topic categorisation (politics, sports, entertainment, technology, health, business) and three-class sentiment analysis (positive, negative, neutral) using the newly constructed HindiSentiment-6 corpus — a 12,000-document Hindi text dataset scraped from news portals (Dainik Bhaskar, Amar Ujala, Navbharat Times), social media (Twitter/X Hindi accounts), and e-commerce review platforms (Flipkart, Amazon India). Fine-tuned Hindi-BERT achieves 95.6% accuracy and 95.3% macro-F1 on topic classification and 91.2% accuracy on sentiment analysis across five domains — outperforming TF-IDF+SVM (78.4%), fastText (83.1%), character-level CNN (86.2%), BiLSTM (88.7%), and frozen mBERT (91.3%) baselines. Attention weight visualisation confirms the model captures sentiment-bearing words (rohchak: interesting; prernadayak: inspiring; bekar: useless) as high-attention tokens consistent with human linguistic intuition. The HindiSentiment-6 corpus and fine-tuned model weights are released publicly to support the Hindi NLP research community.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper