What question did this study set out to answer?

This study aims to assess Korean-native and multilingual large language models in the context of medical question-answering for arthritis.

May 9, 2026Open Access

Korean Large Language Models for Medical Question Answering on Arthritis: Fine-tuning and Comparative Evaluation

Key Points

This study aims to assess Korean-native and multilingual large language models in the context of medical question-answering for arthritis.
Built a dataset of 5,451 Korean QA pairs from a public medical corpus.
Fine-tuned five LLMs (Mi:dm, EXAONE, Kanana, HyperCLOVAX, LLaMA) using Low-Rank Adaptation with 4-bit quantization.
Evaluated model performance using BERTScore-F1, SBERT similarity, and qualitative metrics.
EXAONE and HyperCLOVAX displayed comparable performance in semantic accuracy and contextual consistency.
Mi:dm achieved the highest quality in clinical correctness despite lower similarity scores.
LLaMA underperformed in Korean QA, indicating challenges in domain adaptation.

Abstract

Objectives: Large language models (LLMs) are increasingly being used for medical question-answering (QA) tasks. However, most models are trained primarily on English-language data, which limits their effectiveness in non-English clinical contexts. In healthcare settings such as Korea, LLMs adapted to both the local language and the medical domain are needed. This study aimed to evaluate the performance of Korean-native and multilingual LLMs fine-tuned on Korean arthritis-related medical QA data and to examine the impact of language and domain adaptation.Methods: A dataset of 5,451 Korean QA pairs related to arthritis was constructed from a public medical corpus (AI Hub). Five LLMs (Mi:dm, EXAONE, Kanana, HyperCLOVAX, and LLaMA) were fine-tuned under identical conditions using Low-Rank Adaptation with 4-bit quantization. Model performance was evaluated on 597 validation samples using BERTScore-F1 and SBERT similarity, along with a qualitative evaluation of clinical correctness, safety, and response completeness.Results: EXAONE and HyperCLOVAX showed comparable quantitative performance in semantic accuracy and contextual consistency. Mi:dm achieved lower similarity-based scores than EXAONE and HyperCLOVAX but showed the highest performance in the qualitative evaluation, particularly for clinical correctness. Kanana exhibited moderate performance with limited domain adaptability. LLaMA showed the lowest performance in Korean medical QA, although it achieved the largest relative improvement, indicating challenges in adaptation to Korean clinical contexts.Conclusions: Domain-specific fine-tuning and Korean-oriented model design improved performance in Korean arthritis medical QA. EXAONE and HyperCLOVAX achieved the highest semantic similarity, whereas Mi:dm demonstrated superior clinical correctness and safety in the qualitative evaluation. General multilingual LLaMA remained limited despite substantial gains, supporting the development of disease-specific Korean medical LLMs.

Bookmark

View Full Paper

Bookmark

View Full Paper

Korean Large Language Models for Medical Question Answering on Arthritis: Fine-tuning and Comparative Evaluation

Key Points

Abstract

Cite This Study