What type of study is this?

This is a Validation Study study.

November 25, 2025Open Access

Domain and Language adaptive pre-training of BERT models for Korean-English bilingual clinical text analysis

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Objective To develop bilingual Korean-English medical language models through domain- and language-adaptive pre-training and evaluate their performance in clinical text analysis tasks, specifically semantic similarity and multi-label classification. Methods A bilingual corpus comprising Korean (medical textbooks and online health articles) and English (medical textbooks, health-related articles, and MIMIC-IV EHRs) clinical texts were constructed. Three BERT-based foundation models (Korea Medical KM-BERT, English Biomedical BioBERT, and multilingual general domain M-BERT) underwent additional pre-training using a newly created bilingual WordPiece vocabulary (45,000 tokens). Model performance was assessed intrinsically on the medical semantic textual similarity (MedSTS) benchmark and extrinsically through multi-label classification of chest computed tomography (CT) reports from tertiary hospitals. Macro F1 scores and Pearson’s correlation coefficients were used as primary evaluation metrics. Results After bilingual pre-training, the Korean semantic similarity performance of bi-BioBERT improved significantly from a Pearson correlation coefficient ranging 0.190–0.871. In the multi-label classification of chest CT reports, all bilingual models outperformed their respective foundation models; bi-KM-BERT achieved the highest Macro F1 score in both internal (0.9460 vs. 0.8902 for KM-BERT) and external validation (0.9288 vs. 0.8495 for KM-BERT). However, bi-KM-BERT and bi-M-BERT experienced semantic performance declines in Korean tasks, indicating catastrophic forgetting, and gradient-based token-importance heatmaps confirmed that the bilingual models captured critical cross-lingual medical contexts more effectively. Conclusion The findings underscore that careful bilingual vocabulary curation and targeted domain-adaptive pre-training enhance natural language processing (NLP) performance in multilingual clinical environments, even with modest training resources. Continual-learning strategies should be explored to mitigate minor forgetting effects. Domain- and language-adaptive pre-training of bilingual medical corpora improves NLP model performance in multilingual clinical settings, thereby providing a scalable strategy for enhancing clinical text analysis capabilities in resource-limited bilingual contexts.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper