Optimizing automated sepsis treatment policies using Reinforcement Learning (RL) has gained attention for improving quality of medical care and address physician shortages. However, in an offline setting, an RL agent cannot explore all possible treatment episodes, leading to an overestimation of Q-values for unexplored treatments. This causes a significant deviation from the physician's policy and results in the RL policy converging to a suboptimal policy. To address this problem, we propose Dyna-Based Discriminative Reinforcement Learning (DDRL), which aims to learn an optimal treatment policy that aligns with physician treatment policy. Our method utilizes both Electronic Medical Record (EMR) data and simulated treatment episodes to mitigate the limitations of restricted treatment exploration. Additionally, by leveraging a Discriminator, we suppress the Q-values of out-of-distribution treatments, preventing overestimation and reducing deviation from the physician treatment policies. The method was evaluated using data from Ajou University Hospital and Asan Medical Center. The expected return of the DDRL policy was 7.29 for Asan Medical Center and 4.55 for Ajou University Hospital, outperforming the Conservative Q-Learning (CQL) method by 3.4% and 5.6%, and surpassing the physician's policy by 18.7% and 8.3% respectively. The cosine similarity between DDRL and physician policies was 81.68% for Asan Medical Center and 90.90% for Ajou University Hospital, which is 0.73% and 26.11% higher, respectively, than the CQL method.
kim et al. (Thu,) studied this question.