What question did this study set out to answer?

March 23, 2026Open Access

Bangla-MedER: An Annotated Bangla Dataset for Multi-Type Medical Entity Recognition from Medical Text

Key Points

The aim is to create a dataset for multi-type medical entity recognition in Bangla text.
Manually curated dataset containing 2980 records
Records collected from medicine-related websites and pharmaceutical articles
Annotations across six entity types verified by medical experts
Privacy measures implemented by removing proprietary names
Dataset supports medical named entity recognition in Bangla medical text
Facilitates biomedical information retrieval and decision-support systems
Provides a benchmark for research in healthcare NLP and biomedical informatics

Abstract

Medical Entity Recognition (MedER) systems are needed to enhance the use and accessibility of Natural Language Processing (NLP) methods in the medical field. Since medical entity recognition in Bangla is a relatively new field, no such datasets are currently available in any repository. Unlike AI-generated data, which may contain biases or errors from auto- mated algorithms, the Bangla-MedER dataset is a manually curated resource for multi-type medical entity recognition in Bangla-language drug-indication text. A total of 2980 records were collected from publicly available medicine-related websites, pharmaceutical articles, and other sources of drug information. Each record contains the original Bangla medical text, along with expert-verified annotations across six entity types: medicine/chemical name, organ, disease, hormone, pharmacological class, and common medical terms. The raw transcribed text is provided to support reproducible research. All annotations were performed manually under the guidance of a certified medical expert, and proprietary brand names or personally identifiable information were removed to ensure privacy. The Bangla-MedER dataset enables a variety of applications, including medical named entity recognition in Bengali medical text, biomedical information retrieval, and clinical decision-support systems in a low-resource language environment. The complete raw dataset, along with documentation, is publicly available, offering a benchmark for medical entity recognition, healthcare NLP, biomedical informatics, and another related research.

Bangla-MedER: An Annotated Bangla Dataset for Multi-Type Medical Entity Recognition from Medical Text

Key Points

Abstract

Cite This Study