What question did this study set out to answer?

The research aims to assess the reasoning abilities of large language models in the biomedical field using a comprehensive benchmark.

February 5, 2026Open Access

CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research

Key Points

The research aims to assess the reasoning abilities of large language models in the biomedical field using a comprehensive benchmark.
Introduced CARDBiomedBench with over 68,000 curated question-answer pairs.
Focused on neurodegenerative disease research, integrating genomics and pharmacology.
Utilized BioScore for evaluating model responses based on response accuracy and safety rates.
Tested 18 state-of-the-art large language models against the benchmark.
No model achieved a balance of accuracy and safety metrics.
Claude-3.5-Sonnet had a safety rate of 75% but a low RQR of 24%.
GPT-4.1 demonstrated a safety rate of 7% but a higher RQR of 51%.
Significant performance gaps were observed among the models tested.

Abstract

Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this research gap, we introduce CARDBiomedBench, a large-scale question-and-answer benchmark for evaluating LLMs in biomedical science. This pilot release focuses on neurodegenerative disease research, a field requiring the integration of genomics, pharmacology, and statistical reasoning. CARDBiomedBench includes more than 68 000 curated question-answer pairs generated through expert annotation and structured data augmentation. The questions spanned ten biological categories and nine reasoning types, based on publicly available resources, such as genome-wide association studies, summary data-based mendelian randomisation results, and regulatory drug databases. We assessed model responses using BioScore, a rubric-based evaluation system that measures response accuracy (response quality rate, RQR) and the ability to abstain from incorrect answers (safety rate). Testing 18 state-of-the-art LLMs revealed considerable gaps. Claude-3.5-Sonnet achieved high caution but low accuracy (safety rate 75%, RQR 24%), whereas GPT-4.1 showed the opposite trade-off (safety rate 7%, RQR 51%). No model showed a successful balance of both metrics. CARDBiomedBench provides a new standard for benchmarking biomedical LLMs, revealing key limitations in existing models and offering a scalable path towards safer, more effective artificial intelligence systems in scientific research.

CARDBiomedBench: a benchmark for evaluating the performance of large language models in biomedical research

Key Points

Abstract

Cite This Study

Also Consider

Also Consider