What question did this study set out to answer?

The study aims to assess the effectiveness of large language models for nutrition risk screening using EHR data.

January 20, 2026Open Access

Evaluation of large language models in nutrition risk screening: a comparative analysis across 8 LLMs based on real-world EHR datasets

Key Points

The study aims to assess the effectiveness of large language models for nutrition risk screening using EHR data.
Compared eight large language models based on real-world EHR datasets.
Developed structured prompts and scoring criteria through expert collaboration.
Evaluated models on accuracy, consistency, and processing efficiency using numerous inpatient records.
Five of the eight models achieved over 90% correct assessment rate in nutritional risk.
The top model reached 99.16% in binary risk classification accuracy.
Performance varied widely in total score accuracy, with domain-specific assessments showing strengths in serum albumin.

Abstract

Nutrition risk screening (NRS) is a critical step in the early identification of malnutrition among hospitalized patients. Traditional methods, which rely on manual assessments using tools such as Nutrition Risk Screening 2002 (NRS-2002) based on electronic health records (EHRs), are time-consuming and often yield in accuracy. Large language models (LLMs) offer promising potential to automate this process; however, their capabilities in this scenario remain underexplored and not yet fully realized. A multidisciplinary expert group developed standardized scoring criteria and structured prompts using prompt engineering techniques, optimized using an 80-case prompt development cohort. Eight advanced LLMs with different architectures, parameter scales, and openness levels were evaluated using 592 real-world inpatient EHRs. Each model independently assessed every case twice with uniform structured prompt, determined nutritional risk, and generated reasoning outputs decomposed into total and domain-level scores for nutritional status, disease severity, and age. Model performance was assessed across multiple dimensions, including accuracy: risk-specific, total-specific, and domain-specific correct assessment rate (CrAR), consistency: consistent assessment rate (CsAR), and efficiency: processing time. Using a structured prompt, five of eight LLMs achieved over 90% CrAR in binary nutritional risk classification, with top models reaching 99.16% (DeepSeek-R1-671B). Performance in total score CrAR varied widely (54.73% − 95.60%), while domain-specific CrAR was highest in nutritional status, with serum albumin and age scoring near-perfect across models. The CrAR of disease severity was more challenging, showing greater inter-model variability. Larger parameter scales LLMs demonstrated higher accuracy and repeatability, with Cohen’s κ up to 0.99, whereas smaller LLMs like Qwen3-8B showed marked declines (κ = 0.69). Domain-level consistency was particularly strong in structured subdomains (albumin and age), while subdomains requiring complex clinical inference (disease burden) yielded lower consistency. Qwen3-235B-A22B-Thinking-2507 was lowest (60.7s/case); smaller LLMs had lower accuracy and faster responses. LLMs guided by structured prompts can effectively perform automated NRS, with larger parameter scales models achieving near-expert reliability. These findings support the integration of LLMs into clinical workflows, especially in settings with limited human resources. Future work should explore fine-tuning smaller LLMs for greater deployment efficiency while maintaining diagnostic robustness, as well as expanding applications of LLMs to broader clinical decision-making tasks in support of health equity.

Evaluation of large language models in nutrition risk screening: a comparative analysis across 8 LLMs based on real-world EHR datasets

Key Points

Abstract

Cite This Study