What question did this study set out to answer?

The study aims to evaluate if incorporating machine learning-derived risk data can improve the quality of clinical support information from large language models.

April 23, 2026Open Access

Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study

Key Points

The study aims to evaluate if incorporating machine learning-derived risk data can improve the quality of clinical support information from large language models.
Analyzed retrospective data from MIMIC-IV v3.1 ICU admissions using an XGBoost model to assess mortality risk.
Applied SHAP values to derive meaningful interpretations of risk.
Compared the performance of GPT-4o against seven other LLMs using the IMPACT framework for quality assessment.
GPT-4o significantly improved IMPACT scores when augmented with predicted mortality risk and SHAP values.
Claude 3.7 Sonnet exhibited a high correlation with human ratings, indicating reliable performance.
GPT-5 mini and gpt-oss-120B outperformed GPT-4o in interpretability and quality metrics.

Abstract

Abstract Background C urrent machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can transform ML model-predicted risk estimates with Shapley Additive Explanations (SHAP) into clinically meaningful support information, yet the added value of incorporating ML-derived data and the relative performance of different LLMs remain uncertain. To address these gaps, we used our previously developed IMPACT framework to evaluate the quality of LLM-generated outputs. Methods In this retrospective analysis of MIMIC-IV v3.1 intensive care unit (ICU) admissions, we applied a previously developed XGBoost model to estimate ICU mortality risk and derive corresponding SHAP values. GPT-4o transformed the predicted mortality risk, clinical predictors, and their SHAP values into risk interpretation, recommended examinations and management. The primary analysis examined whether augmenting LLM inputs with predicted mortality risk and SHAP values improved clinical response quality, as assessed by the IMPACT framework. We further compared GPT-4o with seven contemporary LLMs; all eight models generated clinical support responses that were scored by Claude 3.7 Sonnet to assess performance differences. Results Claude 3.7 Sonnet showed excellent agreement with human IMPACT ratings (intraclass correlation coefficient ICC 0.979, 95% CI 0.973–0.984) and o3-mini (ICC 0.971, 95% CI 0.964–0.980). In the primary analysis, adding predicted ICU mortality risk and SHAP values significantly increased GPT-4o IMPACT scores across prompting strategies. GPT-5 mini (96.0) and gpt-oss-120B (93.4) outperformed GPT-4o (90.4; both p < 0.001) for interpretability and quality. Conclusions Combining ML-derived risk, SHAP explanations and LLMs may modestly improve ICU clinical support information, while LLM-based evaluators demonstrated feasibility for scalable evaluation of generated clinical content.

Bookmark

View Full Paper

Bookmark

View Full Paper

Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study

Key Points

Abstract

Cite This Study