What question did this study set out to answer?

The research aims to evaluate the effectiveness of risk-sensitive reinforcement learning algorithms on heavy-tailed return distributions.

March 5, 2026Open Access

From Risk-Neutral to Risk-Sensitive Reinforcement Learning: Actor–Critic vs REINFORCE with Tail-Based Risk Measures

Key Points

The research aims to evaluate the effectiveness of risk-sensitive reinforcement learning algorithms on heavy-tailed return distributions.
Compared REINFORCE with baseline (REINFORCE-BL) and episodic batched actor-critic (A2C-B).
Utilized a simple portfolio environment with discrete actions: market entry, market exit, and hold.
Trained both algorithms under four risk scenarios: risk-neutral, VaR, CVaR, and EVaR.
A2C-B consistently outperformed REINFORCE-BL across all scenarios with higher average long-term rewards.
A2C-B showed faster convergence rates and more stable learning curves than REINFORCE-BL.
VaR and CVaR penalties reduced rewards for REINFORCE-BL but had only a moderate impact on A2C-B.
In the EVaR scenario, both algorithms provided high rewards, but A2C-B maintained greater stability.

Abstract

his study investigates the application of risk-sensitive reinforcement learning on heavy-tailed return series by comparing two primary algorithms: REINFORCE with baseline (REINFORCE-BL) and episodic batched actor--critic (A2C-B). Initial exploratory analysis reveals an asymmetric return distribution with numerous extreme outliers, rendering variance-based risk measures inadequate and motivating the integration of tail-based risk measures—specifically Value at Risk (VaR), Conditional Value at Risk (CVaR), and Entropic Value at Risk (EVaR) —into the RL objective function. This study constructs a simple portfolio environment with discrete actions (market entry, market exit, and hold) and trains both algorithms under four scenarios: risk-neutral, VaR, CVaR, and EVaR. Experimental results demonstrate that A2C-B consistently outperforms REINFORCE-BL across all scenarios, exhibiting higher average long-term rewards, faster convergence rates, and more stable learning curves. While VaR and CVaR penalties significantly reduce rewards and increase learning volatility for REINFORCE-BL, A2C-B experiences only moderate reward reductions while maintaining stability. In the EVaR scenario, both algorithms yield high rewards, yet A2C-B retains a slight advantage in terms of stability. These findings indicate that in environments with heavy-tailed returns, employing coherent risk measures (particularly CVaR and EVaR) within an actor--critic framework offers a more compelling trade-off between tail risk control and average performance, serving as a viable baseline for the development of risk-sensitive RL in finance and actuarial science.

Bookmark

View Full Paper

Bookmark

View Full Paper

From Risk-Neutral to Risk-Sensitive Reinforcement Learning: Actor–Critic vs REINFORCE with Tail-Based Risk Measures

Key Points

Abstract

Cite This Study