Medical Entity Recognition (MedER) systems are needed to enhance the use and accessibility of Natural Language Processing (NLP) methods in the medical field. Since medical entity recognition in Bangla is a relatively new field, no such datasets are currently available in any repository. Unlike AI-generated data, which may contain biases or errors from auto- mated algorithms, the Bangla-MedER dataset is a manually curated resource for multi-type medical entity recognition in Bangla-language drug-indication text. A total of 2980 records were collected from publicly available medicine-related websites, pharmaceutical articles, and other sources of drug information. Each record contains the original Bangla medical text, along with expert-verified annotations across six entity types: medicine/chemical name, organ, disease, hormone, pharmacological class, and common medical terms. The raw transcribed text is provided to support reproducible research. All annotations were performed manually under the guidance of a certified medical expert, and proprietary brand names or personally identifiable information were removed to ensure privacy. The Bangla-MedER dataset enables a variety of applications, including medical named entity recognition in Bengali medical text, biomedical information retrieval, and clinical decision-support systems in a low-resource language environment. The complete raw dataset, along with documentation, is publicly available, offering a benchmark for medical entity recognition, healthcare NLP, biomedical informatics, and another related research.
Sheikh et al. (Sun,) studied this question.