What question did this study set out to answer?

Assess the performance of large language models in answering clinical questions based on varying source materials.

April 1, 2026Open Access

Evaluating large language models for evidence-based clinical question answering

Key Points

Assess the performance of large language models in answering clinical questions based on varying source materials.
Curated a benchmark with over 20,000 question-answer pairs from systematic reviews and clinical guidelines.
Evaluated models including GPT-5, GPT-4o-mini, Claude 4, and DeepSeek-v3.
Measured accuracy across different types of evidence sources, including structured guidelines and narrative sources.
Examined the impact of citation counts and geographic variation on model performance.
Accuracy was highest for structured guidelines (90%) and lowest for systematic reviews (50%-60%).
Increased citation counts correlated with improved accuracy.
Retrieval-augmented generation improved performance by 23% with top PubMed articles.
Models struggled with ambiguous evidence, and consistency in performance across models was observed.

Abstract

Summary Large language models show potential in clinical applications, yet reliability for evidence-based medicine requires rigorous evaluation. We curated a multi-source benchmark with more than 20,000 question answering pairs from systematic reviews and clinical guidelines to assess performance on GPT-5, GPT-4o-mini, Claude 4, and DeepSeek-v3. Accuracy was highest with structured guidelines (90%), lower with narrative sources (70%), and lowest with systematic reviews (50%–60%). All models struggled with ambiguous evidence. We found that higher citation counts for source material correlated with increased accuracy and observed moderate geographic variation in performance. However, accuracy did not vary significantly by publication year or domain prevalence. Retrieval-augmented generation bolstered performance; providing the top three PubMed-retrieved articles yielded a 23% accuracy gain. These patterns were consistent across models, demonstrating that source clarity and targeted retrieval drive performance. We conclude that stratified evaluation and retrieval strategies are essential for ensuring factual alignment and reliability in high-stakes clinical decision-making.

Evaluating large language models for evidence-based clinical question answering

Key Points

Abstract

Cite This Study