What question did this study set out to answer?

This research aims to evaluate the diagnostic performance of large language models in differentiating between jawbone malignancy and osteomyelitis.

April 29, 2026Open Access

Bi-linguistic performance of large language models in multimodal analysis for differentiating jawbone-destroying malignancy from osteomyelitis

Key Points

This research aims to evaluate the diagnostic performance of large language models in differentiating between jawbone malignancy and osteomyelitis.
Retrospective diagnostic accuracy study with 50 patients.
Three multimodal language models assessed using Korean and English prompts.
Imaging conditions included panoramic radiograph, computed tomography, and histopathology slides.
Overall diagnostic accuracy improved from 0.683 to 0.978 with additional imaging modalities.
Histopathology incorporation significantly increased odds of correct diagnosis (p < 0.0001).
ChatGPT showed higher accuracy in Korean compared to English.

Abstract

Differentiating jawbone-destroying malignancy from osteomyelitis remains a major diagnostic challenge in oral and maxillofacial surgery because these entities share overlapping radiologic features but require fundamentally different management strategies. This study evaluated the bi-linguistic diagnostic performance of advanced multimodal large language models (LLMs) in distinguishing these conditions using stepwise multimodal inputs. In this retrospective diagnostic accuracy study, 50 patients with histopathologically confirmed malignancy or osteomyelitis of the maxilla or mandible were included. Three multimodal LLMs (ChatGPT, Claude, and Gemini) were assessed using standardized prompts in Korean and English under three imaging conditions: panoramic radiograph only (P), panoramic radiograph plus computed tomography (CT) (P + C), and panoramic radiograph plus CT plus histopathology slide (P + C+B). Diagnostic accuracy, sensitivity, and specificity were evaluated against histopathology as the reference standard using generalized linear mixed models. Overall diagnostic accuracy increased significantly with additional modalities, from 0.683 (95% CI, 0.542–0.797) under the P condition to 0.776 (95% CI, 0.652–0.865) under P + C, and to 0.978 (95% CI, 0.953–0.990) under P + C+B. Incorporation of histopathology slides markedly increased the odds of a correct diagnosis compared with P and P + C conditions (both p < 0.0001), while CT addition alone showed a nonsignificant trend toward improvement. Under limited imaging conditions, models tended to overdiagnose malignancy, reflecting high sensitivity but low specificity. With full multimodal input, all models achieved balanced diagnostic performance across models and languages. Notably, ChatGPT demonstrated higher diagnostic accuracy in the Korean-language condition than in English. Overall, these findings suggest that multimodal LLMs may support diagnostic interpretation by integrating heterogeneous imaging information, highlighting their potential role as adjunctive decision-support tools within existing maxillofacial diagnostic frameworks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper