What question did this study set out to answer?

The aim is to improve the evaluation of medical large language models through adaptive prompt selection using reinforcement learning.

February 5, 2026Open Access

Deep Reinforcement Learning-Driven Adaptive Prompting for Robust Medical LLM Evaluation

Key Points

The aim is to improve the evaluation of medical large language models through adaptive prompt selection using reinforcement learning.
Developed a reinforcement learning framework for multi-prompt selection.
Formulated prompt selection as a Markov Decision Process (MDP).
Implemented a deep Q-Network (DQN) to maximize a multi-objective reward signal.
Evaluated performance on three medical datasets: MKQA, MCQ, and Doctor-Patient Dialogue.
Achieved a 6.66% increase in reward on MKQA compared to the Random Baseline.
Improved safety with a score of 1.0000 on MKQA, a 5.26% increase over the Fixed Baseline.
Increased Medical Terminology Coverage by 74.61% on MKQA over the Fixed Baseline.
Observed accuracy improvements in the MKQA task, with some trade-offs noted in other metrics.

Abstract

The accurate and reliable evaluation of large language models (LLMs) in medical domains is critical for real-world clinical deployment, automated medical reasoning, and patient safety. However, the evaluation process is highly sensitive to prompt design, and prevalent reliance on fixed or randomly sampled prompt policies often fails to dynamically adapt to clinical context, question complexity, or evolving safety requirements. This article presents a novel reinforcement learning-based framework for multi-prompt selection, which dynamically optimizes prompt policy per input for medical LLM evaluation across the Medical Knowledge Question-Answering dataset (MKQA), the Medical Multiple-Choice Question dataset (MCQ), and the Doctor-Patient Dialogue dataset. We formulate prompt selection as a Markov Decision Process (MDP) and employ a deep Q-Network (DQN) agent to maximize a reward signal incorporating textual accuracy, domain terminology coverage, safety, and dialogue relevance. Experiments on three medical LLM benchmarks demonstrate consistent improvements in composite reward (e.g., a 6.66% increase in MKQA vs. Random Baseline, and a 2.41% increase in Dialogue vs. Fixed Baseline) when compared to baselines. This was accompanied by robust enhancements in Safety (e.g., achieving 1.0000 in MKQA, a 5.26% increase vs. Fixed Baseline; and a 5.03% increase in Dialogue vs. Fixed Baseline) and substantial gains in Medical Terminology Coverage (e.g., a 74.61% increase in MKQA vs. Fixed Baseline, and a 9.13% increase in MCQ vs. Fixed Baseline) when compared to baselines. While varying across tasks, an improvement in accuracy was observed in the MKQA task, and the framework effectively optimizes the multi-objective reward function, even when minor trade-offs in other metrics like Accuracy and Contextual Relevance were observed in some contexts. Our framework enables robust, context-aware, and adaptive evaluation, laying a foundation for safer and more reliable LLM application in healthcare.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper