Objectives: Large language models (LLMs) are increasingly being used for medical question-answering (QA) tasks. However, most models are trained primarily on English-language data, which limits their effectiveness in non-English clinical contexts. In healthcare settings such as Korea, LLMs adapted to both the local language and the medical domain are needed. This study aimed to evaluate the performance of Korean-native and multilingual LLMs fine-tuned on Korean arthritis-related medical QA data and to examine the impact of language and domain adaptation.Methods: A dataset of 5,451 Korean QA pairs related to arthritis was constructed from a public medical corpus (AI Hub). Five LLMs (Mi:dm, EXAONE, Kanana, HyperCLOVAX, and LLaMA) were fine-tuned under identical conditions using Low-Rank Adaptation with 4-bit quantization. Model performance was evaluated on 597 validation samples using BERTScore-F1 and SBERT similarity, along with a qualitative evaluation of clinical correctness, safety, and response completeness.Results: EXAONE and HyperCLOVAX showed comparable quantitative performance in semantic accuracy and contextual consistency. Mi:dm achieved lower similarity-based scores than EXAONE and HyperCLOVAX but showed the highest performance in the qualitative evaluation, particularly for clinical correctness. Kanana exhibited moderate performance with limited domain adaptability. LLaMA showed the lowest performance in Korean medical QA, although it achieved the largest relative improvement, indicating challenges in adaptation to Korean clinical contexts.Conclusions: Domain-specific fine-tuning and Korean-oriented model design improved performance in Korean arthritis medical QA. EXAONE and HyperCLOVAX achieved the highest semantic similarity, whereas Mi:dm demonstrated superior clinical correctness and safety in the qualitative evaluation. General multilingual LLaMA remained limited despite substantial gains, supporting the development of disease-specific Korean medical LLMs.
Jun‐hee Kim (Thu,) studied this question.