What question did this study set out to answer?

April 19, 2026

DDRL:Dyna-Based Discriminative Reinforcement Learning for Optimizing Sepsis Treatment Pathways in Offline Environments

Key Points

The research aims to optimize automated sepsis treatment policies using reinforcement learning to align with physician practices.
Developed Dyna-Based Discriminative Reinforcement Learning (DDRL) algorithm.
Utilized electronic medical record data and simulated treatment episodes for training.
Incorporated a Discriminator to suppress out-of-distribution treatment Q-values.
DDRL achieved an expected return of 7.29 and 4.55 at the two hospitals respectively.
Outperformed Conservative Q-Learning by 3.4% and 5.6% at each hospital.
Surpassed physician policies by 18.7% and 8.3% respectively.
Cosine similarity with physician policies was 81.68% and 90.90%, exceeding CQL by 0.73% and 26.11%.

Abstract

Optimizing automated sepsis treatment policies using Reinforcement Learning (RL) has gained attention for improving quality of medical care and address physician shortages. However, in an offline setting, an RL agent cannot explore all possible treatment episodes, leading to an overestimation of Q-values for unexplored treatments. This causes a significant deviation from the physician's policy and results in the RL policy converging to a suboptimal policy. To address this problem, we propose Dyna-Based Discriminative Reinforcement Learning (DDRL), which aims to learn an optimal treatment policy that aligns with physician treatment policy. Our method utilizes both Electronic Medical Record (EMR) data and simulated treatment episodes to mitigate the limitations of restricted treatment exploration. Additionally, by leveraging a Discriminator, we suppress the Q-values of out-of-distribution treatments, preventing overestimation and reducing deviation from the physician treatment policies. The method was evaluated using data from Ajou University Hospital and Asan Medical Center. The expected return of the DDRL policy was 7.29 for Asan Medical Center and 4.55 for Ajou University Hospital, outperforming the Conservative Q-Learning (CQL) method by 3.4% and 5.6%, and surpassing the physician's policy by 18.7% and 8.3% respectively. The cosine similarity between DDRL and physician policies was 81.68% for Asan Medical Center and 90.90% for Ajou University Hospital, which is 0.73% and 26.11% higher, respectively, than the CQL method.

Bookmark

DDRL:Dyna-Based Discriminative Reinforcement Learning for Optimizing Sepsis Treatment Pathways in Offline Environments

Key Points

Abstract

Cite This Study