ABSTRACT Importance Rigorous evaluation of large language models (LLMs) in pediatric diagnosis using authentic clinical presentations remains limited, particularly regarding response consistency and rare disease recognition. Objective To evaluate the diagnostic accuracy, consistency, and clinical usability of LLMs as diagnostic support tools in pediatric medicine compared with human clinicians using real‐world cases. Methods This cross‐sectional study at Sant Joan de Déu Barcelona Children's Hospital evaluated four LLMs DxGPT/GPT‐4 (0613), Claude‐3.5 Sonnet, GPT‐4o (0513), and o1‐preview against 78 pediatric clinicians using 50 real clinical cases (25 rare diseases, 25 common conditions) from a single tertiary pediatric center. All cases were presented using Spanish intake‐style clinical summaries. Each case was queried three times per LLM and evaluated by clinicians with different experience levels. Performance was assessed using the Top‐1 and Top‐5 diagnostic accuracy, response consistency (intraclass correlation coefficient), and qualitative evaluation. Extended clinical information was provided for 20 cases to assess the diagnostic efficiency. Results Advanced LLMs significantly outperformed the clinicians in terms of diagnostic accuracy. o1‐preview and Claude‐3.5 Sonnet achieved mean Top‐1 accuracies of 60.0% and 59.0%, respectively, compared to clinicians’ 48.2% (odds ratios ORs: 2.99 and 2.75, both P < 0.001). Performance advantages were most pronounced for rare diseases, where o1‐preview demonstrated 6‐fold higher Top‐5 diagnostic odds compared to clinicians (ORs: 6.00, P < 0.001). Extended clinical information improved the accuracy of both groups, particularly for rare diseases. Human‐Artificial intelligence complementarity analysis revealed 94.3% union accuracy with o1‐preview, representing a 10‐percentage‐point uplift over clinicians alone. Clinicians rated DxGPT favorably (mean, 3.9/5), particularly for rare case support (4.1/5). Interpretation In this proof‐of‐concept study of a reference care center, newer LLMs outperformed previous models and human clinicians in complex pediatric diagnostics, particularly for rare diseases. These findings support further evaluation as augmentative diagnostic tools in similar settings, with appropriate legal, ethical, and clinical oversight frameworks.
Launes et al. (Wed,) studied this question.