What question did this study set out to answer?

The aim is to enhance text classification for Bangla regional dialects through a hybrid approach involving lexical oversampling and BERT models.

April 21, 2026Open Access

A hybrid approach for Bangla regional text classification using region-specific lexical oversampling and BERT ensemble learning

Key Points

The aim is to enhance text classification for Bangla regional dialects through a hybrid approach involving lexical oversampling and BERT models.
Analyzed a dataset of 4218 text samples from five regional dialects in Bangladesh.
Implemented a tiered dataset structure to balance data using region-specific words and validated by experts.
Employed a heterogeneous ensemble technique combining multiple BERT models for improved classification performance.
Achieved 67.45% accuracy and 67.62% weighted F1-score with the individual BanglaBERT model.
The ensemble of three BERT models reached an accuracy of 85.17% and a weighted F1-score of 84.84% on the high-quality Tier 1 dataset.

Abstract

Regional text analysis reflects the lived realities of diverse communities by capturing the linguistic richness and diversity present in various dialects. It bridges the gap between everyday regional usage and standardized language forms, thereby enhancing the inclusivity of language technologies. In this paper, we focus on five regional dialects in Bangladesh, namely Chittagong, Sylhet, Noakhali, Barishal, and Rangpur, using a dataset of 4218 text samples. The dataset is validated by five regional experts and categorized into three tiers based on an assigned agreement criterion. Tier 1 represents a strictly filtered, high-confidence subset and is used primarily for evaluation. A set of region-specific special words, which belong exclusively to their respective regions and are validated by domain experts, is introduced. These words are used in a linguistically informed oversampling technique to balance the dataset in both experiments. In the first experiment, we demonstrate the effectiveness of the tiered dataset structure, where Tier 2 and Tier 3 (medium- and low-confidence subsets) are used for training, and Tier 1 (high-quality subset) is used for testing. In this setting, BanglaBERT achieves the best individual performance with 67.45% accuracy and a weighted F1-score of 67.62%. In the second experiment, we focus exclusively on the Tier 1 dataset, applying a wide range of machine learning and deep learning models to assess their effectiveness. The key contribution is a heterogeneous deep ensemble technique that combines three BERT models, BanglaBERT, BUETBERT, and DistilBERT, achieving an accuracy of 85.17% and a weighted F1-score of 84.84% on the Tier 1 dataset.

A hybrid approach for Bangla regional text classification using region-specific lexical oversampling and BERT ensemble learning

Key Points

Abstract

Cite This Study