The effects of various prompt engineering on Large Language Models (LLMs) performance in hypertension decision-making are not yet fully understood. We evaluate the impact of different prompt engineering on LLM performance in hypertension treatment decision-making. We conducted a two-stage validation study using 300 de-identified simulated hypertension cases based on real-world clinical scenarios. ChatGPT-4.1 with Guidance-Self-Consistency achieved optimal performance (91.3% accuracy), nearing expert-level competency, while zero-shot prompting yielded worst results (62.7% with DeepSeek-V3). Optimal LLM assistance consistently enhanced physicians’ average accuracy across all levels (community hospital: 73.4% to 82.5%; county hospital: 84.0% to 87.9%; teaching hospital: 91.5% to 92.0%) and reduced inappropriate regimen rates. The worst LLM configurations decreased physician performance below baseline, increasing inappropriate regimen rates from 26.6% to 35.2% across all levels. Effectively designed prompt strategies enable LLMs to provide reliable hypertension treatment recommendations, thereby supporting physicians’ clinical decisions. This study has been trial-registered (ChiCTR2500099307, March 21, 2025).
Li et al. (Wed,) studied this question.