BACKGROUND Case-based learning using standardized patients is a key method for teaching communication skills in medicine, but it faces logistical and financial hurdles. While Large Language Models (LLMs) show promise for creating scalable patient simulations, current research often overlooks user-centered design and direct comparison of different LLMs. OBJECTIVE To describe the user-centered design process and system architecture of a digital tool that leverages LLMs to simulate patient conversations for medical education, focusing specifically on taking a medical history Further, the objective is to study the differences between various LLMs in their ability to simulate patient encounters. METHODS We followed a user-centered design process, gathering initial requirements from two medical students. We then developed a fully functional web prototype using a Python Flask backend and a PostgreSQL database, integrating five LLMs from OpenAI, Anthropic, and xAI. The system consists of an AI-assisted case vignette generator and a dynamic patient simulator. To evaluate the system, we first conducted a task-based usability test with five medical students, measuring their experience with the standardized System Usability Scale (SUS) and qualitative questions. Second, we conducted a comparative analysis where four practicing physicians evaluated the simulation quality of three models (Grok 3, GPT-4, and Claude 3 Opus) across seven criteria on a 5-point scale. RESULTS Our usability testing yielded a mean SUS score of M = 91.5 (SD = 8.40), indicating "excellent" usability. The students unanimously praised the system's simplicity and intuitive design. However, they consistently identified the lack of a formal conclusion and feedback on their performance as a key weakness, expressing a desire for a "didactic loop" to maximize the learning effect. In our LLM comparison, Grok 3 achieved the highest overall rating (M = 4.25, SD = 0.75), excelling at depicting realistic timelines and responding to follow-up questions. GPT-4 followed with a mean score of M = 4.14 (SD = 0.8), showing strength in symptom coherence but weakness in portraying realistic uncertainty. Claude 3 Opus was rated lowest (M = 3.86, SD = 0.97) and exhibited the most performance variability. CONCLUSIONS We successfully developed a highly usable patient simulation tool that serves as a foundation for further development. Our results show that while the tool is effective for communication training, its full potential will only be realized by integrating an automated feedback mechanism to create a complete didactic loop, as requested by users. Based on our evaluation, we recommend Grok 3 as the primary model for medical patient simulations, with GPT-4 as a reliable alternative.
Elhilali et al. (Fri,) studied this question.