Nutrition risk screening (NRS) is a critical step in the early identification of malnutrition among hospitalized patients. Traditional methods, which rely on manual assessments using tools such as Nutrition Risk Screening 2002 (NRS-2002) based on electronic health records (EHRs), are time-consuming and often yield in accuracy. Large language models (LLMs) offer promising potential to automate this process; however, their capabilities in this scenario remain underexplored and not yet fully realized. A multidisciplinary expert group developed standardized scoring criteria and structured prompts using prompt engineering techniques, optimized using an 80-case prompt development cohort. Eight advanced LLMs with different architectures, parameter scales, and openness levels were evaluated using 592 real-world inpatient EHRs. Each model independently assessed every case twice with uniform structured prompt, determined nutritional risk, and generated reasoning outputs decomposed into total and domain-level scores for nutritional status, disease severity, and age. Model performance was assessed across multiple dimensions, including accuracy: risk-specific, total-specific, and domain-specific correct assessment rate (CrAR), consistency: consistent assessment rate (CsAR), and efficiency: processing time. Using a structured prompt, five of eight LLMs achieved over 90% CrAR in binary nutritional risk classification, with top models reaching 99.16% (DeepSeek-R1-671B). Performance in total score CrAR varied widely (54.73% − 95.60%), while domain-specific CrAR was highest in nutritional status, with serum albumin and age scoring near-perfect across models. The CrAR of disease severity was more challenging, showing greater inter-model variability. Larger parameter scales LLMs demonstrated higher accuracy and repeatability, with Cohen’s κ up to 0.99, whereas smaller LLMs like Qwen3-8B showed marked declines (κ = 0.69). Domain-level consistency was particularly strong in structured subdomains (albumin and age), while subdomains requiring complex clinical inference (disease burden) yielded lower consistency. Qwen3-235B-A22B-Thinking-2507 was lowest (60.7s/case); smaller LLMs had lower accuracy and faster responses. LLMs guided by structured prompts can effectively perform automated NRS, with larger parameter scales models achieving near-expert reliability. These findings support the integration of LLMs into clinical workflows, especially in settings with limited human resources. Future work should explore fine-tuning smaller LLMs for greater deployment efficiency while maintaining diagnostic robustness, as well as expanding applications of LLMs to broader clinical decision-making tasks in support of health equity.
Gu et al. (Sat,) studied this question.