What question did this study set out to answer?

This research aims to investigate how input structure affects the stability and reproducibility of clinical reasoning in large language models (LLMs) using pediatric MCQs.

April 20, 2026Open Access

Input structure–driven instability and convergence in large language model clinical reasoning: a formative study using pediatric residency–level MCQs

Key Points

This research aims to investigate how input structure affects the stability and reproducibility of clinical reasoning in large language models (LLMs) using pediatric MCQs.
Utilized a standardized prompt to create pediatric residency-level MCQs emphasizing diagnostic reasoning and management.
Reviewed draft questions for accuracy and relevance, resulting in a validated set of 77 MCQs.
Evaluated six LLMs under two question delivery methods: simultaneous and sequential in batches of ten.
Compared model accuracy and variability using paired t-tests and one-way ANOVA.
Model accuracy varied from 38% to 90% when questions were presented simultaneously, indicating poor reproducibility.
Batch delivery of questions improved model performance to 83%-88% accuracy with minimal inter-model differences.
Structured batch delivery reduced performance dispersion, enhancing stability across LLMs.

Abstract

Large language models (LLMs) are increasingly incorporated into medical education and clinical learning environments. While prior studies have focused on model accuracy on licensing-style examinations, less attention has been paid to the stability and reproducibility of LLM clinical reasoning under varying input structures-an issue central to safe educational and clinical deployment. To examine how question delivery structure influences performance stability, inter-model variability, and reproducibility of clinical reasoning across multiple contemporary LLMs using pediatric residency-level multiple-choice questions (MCQs). A standardized, evidence-based prompt was used to generate advanced-level pediatric USMLE Step 2/3-style MCQs emphasizing diagnostic reasoning, management decisions, and ethical judgment. One hundred draft MCQs generated using a standardized LLM prompt were randomly selected and independently reviewed by three pediatric physicians for medical accuracy, clinical realism, subspecialty relevance, and adherence to USMLE formatting. Twenty three questions were excluded by unanimous consensus, yielding a validated set of 77 MCQs. Six publicly available LLMs (ChatGPT, DeepSeek AI, Gemini, Microsoft Copilot, Perplexity AI, and OpenEvidence; October-December 2025 versions) were evaluated under two delivery conditions: (1) simultaneous presentation of all questions and (2) sequential delivery in batches of ten. Accuracy and inter-model variability were compared using paired t-tests and one-way ANOVA. When all questions were presented simultaneously, model accuracy varied widely (38%-90%), with significant inter-model differences, indicating poor reproducibility. In contrast, batch delivery resulted in marked convergence of performance across models (83%-88%), with no statistically significant inter-model differences. Sequential delivery in batches of ten substantially reduced performance dispersion and instability across all evaluated systems. LLM clinical reasoning performance is highly sensitive to input structure. Reducing contextual load through structured batch delivery improves reproducibility and minimizes inter-model variability, independent of model architecture. These findings suggest that prompt structure-rather than model selection alone-is a critical determinant of reliable LLM behavior and should be explicitly considered in the design of AI-supported medical education and assessment systems. This should be explicitly considered in the design of AI-supported medical education and assessment systems, particularly when LLMs are used as formative learning tools or clinical reasoning aids. Clinical Trial Number. The protocol was reviewed by the Office of the IRB at Good Samaritan University Hospital and determined to be exempt.

Perguntar à IA

Bookmark

View Full Paper