What question did this study set out to answer?

The aim is to evaluate the diagnostic accuracy, consistency, and usability of large language models in pediatric diagnostics compared to human clinicians.

March 27, 2026Open Access

Large‐language‐models for pediatric diagnosis: Performance evaluation using real‐world clinical notes from common and rare cases

Key Points

The aim is to evaluate the diagnostic accuracy, consistency, and usability of large language models in pediatric diagnostics compared to human clinicians.
Cross-sectional study conducted at a pediatric hospital
Evaluated four large language models against 78 pediatric clinicians
Used 50 real clinical cases including 25 rare diseases and 25 common conditions
Assessment based on Top-1 and Top-5 diagnostic accuracy and response consistency
Advanced large language models significantly outperformed clinicians in diagnostic accuracy
o1-preview achieved mean Top-1 accuracy of 60%, compared to clinicians’ 48.2%
o1-preview showed 6-fold higher Top-5 diagnostic odds for rare diseases compared to clinicians
Clinicians rated DxGPT favorably, especially for support in rare cases
Extended clinical information improved accuracy for both groups, particularly for rare diseases

Abstract

ABSTRACT Importance Rigorous evaluation of large language models (LLMs) in pediatric diagnosis using authentic clinical presentations remains limited, particularly regarding response consistency and rare disease recognition. Objective To evaluate the diagnostic accuracy, consistency, and clinical usability of LLMs as diagnostic support tools in pediatric medicine compared with human clinicians using real‐world cases. Methods This cross‐sectional study at Sant Joan de Déu Barcelona Children's Hospital evaluated four LLMs DxGPT/GPT‐4 (0613), Claude‐3.5 Sonnet, GPT‐4o (0513), and o1‐preview against 78 pediatric clinicians using 50 real clinical cases (25 rare diseases, 25 common conditions) from a single tertiary pediatric center. All cases were presented using Spanish intake‐style clinical summaries. Each case was queried three times per LLM and evaluated by clinicians with different experience levels. Performance was assessed using the Top‐1 and Top‐5 diagnostic accuracy, response consistency (intraclass correlation coefficient), and qualitative evaluation. Extended clinical information was provided for 20 cases to assess the diagnostic efficiency. Results Advanced LLMs significantly outperformed the clinicians in terms of diagnostic accuracy. o1‐preview and Claude‐3.5 Sonnet achieved mean Top‐1 accuracies of 60.0% and 59.0%, respectively, compared to clinicians’ 48.2% (odds ratios ORs: 2.99 and 2.75, both P < 0.001). Performance advantages were most pronounced for rare diseases, where o1‐preview demonstrated 6‐fold higher Top‐5 diagnostic odds compared to clinicians (ORs: 6.00, P < 0.001). Extended clinical information improved the accuracy of both groups, particularly for rare diseases. Human‐Artificial intelligence complementarity analysis revealed 94.3% union accuracy with o1‐preview, representing a 10‐percentage‐point uplift over clinicians alone. Clinicians rated DxGPT favorably (mean, 3.9/5), particularly for rare case support (4.1/5). Interpretation In this proof‐of‐concept study of a reference care center, newer LLMs outperformed previous models and human clinicians in complex pediatric diagnostics, particularly for rare diseases. These findings support further evaluation as augmentative diagnostic tools in similar settings, with appropriate legal, ethical, and clinical oversight frameworks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper