What question did this study set out to answer?

This research aims to assess the performance of large language models in extracting insights from biobank data.

April 22, 2026Open Access

La benchmarking large language models for extracting biobank-derived insights into health and disease

Key Points

This research aims to assess the performance of large language models in extracting insights from biobank data.
Evaluated six large language models (Gemini 3 Pro, Claude Opus 4.5, etc.) on four benchmark tasks.
Used a multi-metric evaluation framework measuring six dimensions of performance.
Compared model outputs against curated UK Biobank references and a statistical baseline.
All models outperformed the baseline by 16× to 25×, achieving strong statistical significance (p < 0.001).
Gemini 3 Pro showed the highest overall accuracy in tasks like keyword synthesis and topic inference.
Claude Sonnet 4 had the most consistent performance across evaluation metrics.

Abstract

Biobank-scale datasets such as the UK Biobank have become foundational resources for advancing biomedical discovery. Yet the complexity and heterogeneity of these resources, spanning genomics, imaging, clinical records, and metadata, pose substantial barriers to access and interpretation. Large Language Models (LLMs) offer a promising avenue for making such datasets more navigable through natural language interfaces. However, the extent to which current general-purpose LLMs can retrieve and synthesize biobank-specific insights has not yet been systematically evaluated. In this study, we present a reproducible, multi-metric evaluation framework to benchmark the capabilities of leading LLMs. We evaluated six leading large language models: Gemini 3 Pro, Claude Opus 4.5, Claude Sonnet 4, GPT-5.2, Mistral Large, and DeepSeek V3, on four benchmark tasks designed to assess biobank-related knowledge retrieval. We evaluate model performance across six dimensions (coverage, semantic accuracy, factual correctness, domain knowledge, reasoning quality, and biobank specificity) and assessed output consistency using curated UK Biobank references and a robust random baseline. All models outperformed the baseline by 16× to 25 × , with strong statistical separation (p < 0.001), confirming meaningful biobank-specific knowledge retrieval. Gemini 3 Pro achieved the highest overall accuracy across tasks such as keyword synthesis, institution recognition, and topic inference, while Claude Sonnet 4 demonstrated the most uniform performance across evaluation dimensions. Our benchmark provides a rigorous framework for evaluating LLMs in biomedical settings. Using the UK Biobank as a real-world testbed, we highlight both the capabilities and limitations of current models, measuring their capacity to recall structured biomedical knowledge consistent with authoritative biobank metadata.

La benchmarking large language models for extracting biobank-derived insights into health and disease

Key Points

Abstract

Cite This Study