Abstract Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to be inferior to specialized models, creating potential for misuse and misunderstanding. To address this, our ClinicRealm benchmark systematically evaluates 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR) across predictive performance, reasoning, fairness, etc. Our findings reveal a significant shift: on clinical notes, leading zero-shot LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs demonstrate potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs match or exceed their proprietary counterparts. This provides compelling evidence that modern LLMs are competitive tools for clinical prediction, necessitating a re-evaluation of model selection strategies by health data scientists and developers.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yinghao Zhu
Junyi Gao
Zixiang Wang
npj Digital Medicine
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d895d86c1944d70ce06f67 — DOI: https://doi.org/10.1038/s41746-026-02539-z
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: