What question did this study set out to answer?

The aim is to evaluate few-shot prompting strategies for improving cardiovascular disease risk predictions using large language models.

April 18, 2026Open Access

Few-shot prompting strategies for improving large language model-based cardiovascular disease risk prediction

Key Points

The aim is to evaluate few-shot prompting strategies for improving cardiovascular disease risk predictions using large language models.
Used de-identified MIMIC-III/IV records with various prompting strategies.
Compared outputs of GPT-4.1, GPT-4o, and Qwen3-4B with established CVD risk calculators.
Implemented zero-shot and few-shot prompting along with random and similarity-based exemplar selection.
Assessed the impact of chain-of-thought reasoning on model performance.
GPT-4.1 achieved AUPRC of 0.951 and F1-score of 0.85 using few-shot prompting.
Five similarity-selected exemplars outperformed 20 random examples in prediction accuracy.
Few-shot prompting improved calculator alignment significantly for GPT-4.1 and GPT-4o.
Qwen3-4B demonstrated weaker improvement in risk prediction accuracy.

Abstract

Accurate prediction of cardiovascular disease (CVD) risk enables earlier prevention and better clinical decisions. Conventional models such as the Framingham Risk Score (FRS) and Atherosclerotic Cardiovascular Disease (ASCVD) equations may generalize poorly across diverse populations and incomplete electronic health records (EHRs). In this paper, we present a prompting-based alternative that uses few-shot in-context learning to guide large language models (LLMs) in estimating 10-year CVD risk without retraining, offering a data-efficient and privacy-conscious alternative to fine-tuned medical LLM pipelines. Using 352 de-identified MIMIC-III/IV records, we evaluate GPT-4.1, GPT-4o, and Qwen3-4B against FRS and ASCVD outputs under zero-shot and few-shot prompting, random versus similarity-based exemplar selection, and with or without chain-of-thought reasoning. Few-shot prompting substantially improves calculator alignment for GPT-4.1 and GPT-4o, whereas Qwen3-4B shows weaker gains. With 40 examples and reasoning enabled, GPT-4.1 achieves AUPRC 0.951, mean absolute error about 7, root mean squared error about 9, and F1-score 0.85, while GPT-4o performs comparably. Within the white-cohort similarity analysis, five similarity-selected exemplars match or outperform 20 randomly selected examples across error and discrimination metrics, showing that exemplar quality can outweigh quantity under tight context budgets. Overall, these findings indicate that few-shot prompting can closely approximate validated clinical calculators in data-limited settings and can be adapted across institutions and patient populations through exemplar selection rather than retraining. However, clinical utility remains bounded by the strengths and weaknesses of the underlying calculators, and we do not evaluate prediction of observed cardiovascular events.

Few-shot prompting strategies for improving large language model-based cardiovascular disease risk prediction

Key Points

Abstract

Cite This Study