What question did this study set out to answer?

This research aims to assess the performance of large language models compared to traditional methods in clinical prediction tasks.

April 10, 2026Open Access

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks

Key Points

This research aims to assess the performance of large language models compared to traditional methods in clinical prediction tasks.
Benchmark analysis of 15 GPT-style and 5 BERT-style models
Evaluation of 11 conventional machine learning methods
Use of unstructured clinical notes and structured Electronic Health Records for testing
Focus on predictive performance, reasoning, and fairness metrics
Leading zero-shot LLMs outperform fine-tuned BERT models on clinical notes
Advanced LLMs show strong zero-shot performance in data-scarce environments
Open-source LLMs perform comparably to proprietary models in clinical tasks

Abstract

Abstract Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to be inferior to specialized models, creating potential for misuse and misunderstanding. To address this, our ClinicRealm benchmark systematically evaluates 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods on unstructured clinical notes and structured Electronic Health Records (EHR) across predictive performance, reasoning, fairness, etc. Our findings reveal a significant shift: on clinical notes, leading zero-shot LLMs (e.g., DeepSeek-V3.1-Think, GPT-5) now decisively outperform finetuned BERT models. On structured EHRs, while specialized models excel with ample data, advanced LLMs demonstrate potent zero-shot capabilities, often surpassing conventional models in data-scarce settings. Notably, leading open-source LLMs match or exceed their proprietary counterparts. This provides compelling evidence that modern LLMs are competitive tools for clinical prediction, necessitating a re-evaluation of model selection strategies by health data scientists and developers.

Ask AI

Helpful

Bookmark

View Full Paper