Although Large Language Models (LLMs) possess extensive medical knowledge, they often struggle to emulate the complex, iterative process of real-world clinical diagnosis. To address this limitation, we present ClinDiag-GPT, a specialized LLM fine-tuned to execute full diagnostic procedures, supported by the ClinDiag-Framework evaluation system and ClinDiag-Benchmark, a dataset comprising 4,421 real-world cases. Our evaluation shows that existing LLMs, including GPT-4o-mini, GPT-4o, Claude-3-Haiku, Qwen2.5-72b, Qwen2.5-32b, and Qwen2.5-14b, while proficient in static tasks, fall short in dynamic diagnostic workflows and frequently commit clinical errors. In contrast, ClinDiag-GPT, trained on clinical cases, outperforms all baseline models in both diagnostic accuracy and procedural performance. Furthermore, a comparative analysis reveals that collaboration between physicians and ClinDiag-GPT yields higher diagnostic accuracy and efficiency compared to either working alone, demonstrating the utility of ClinDiag-GPT as a clinical assistant.
Chen et al. (Wed,) studied this question.