What question did this study set out to answer?

The aim is to address how reinforcement learning agents are evaluated by decoupling learning from cumulative reward.

May 21, 2026Open Access

What Did Your Agent Actually Learn? Decoupling Learning from Reward in Reinforcement Learning Evaluation

Puntos clave

The aim is to address how reinforcement learning agents are evaluated by decoupling learning from cumulative reward.
Introduced LearnLens, a Python package that computes a Learning Quality Score (LQS) based on four behavioral probes.
Conducted a controlled experiment with three agents to rank them by true performance using LQS.
Applied an LQS-inspired penalty in a GRPO experiment to measure effects on agent metrics.
LQS successfully ranked agents by true quality while cumulative reward failed to do so.
Increased Learning Quality Score from 0.000 to 0.848 while reducing Hack Index from 1.00 to 0.00 during the GRPO experiment.
Cumulative reward only increased by 46.5% without corresponding improvement in true performance.

Resumen

Reinforcement learning agents are evaluated almost exclusively by cumulative reward — a proxythat Skalse et al. NeurIPS 2022 prove is mathematically hackable for any non-constant objective.Gao et al. ICML 2023 document the empirical consequence: proxy reward rises while trueperformance peaks and falls, a divergence invisible to any system tracking only the reward curve.No standard evaluation tool diagnoses this gap post-hoc, per-agent, without modifying the trainingpipeline. We introduce LearnLens, a Python package computing a Learning Quality Score (LQS)— a composite behavioral metric decomposing agent behavior into four probes grounded in theGoodhart taxonomy: Generalization (G), Consistency (C), Hack Index (H), and ReasoningAlignment (R), combined as LQS = sqrt(G x C) x (1 - sqrt(H)) + 0.15 x R x (1 - sqrt(H)), cappedat 1.0. In a controlled three-agent experiment, LQS correctly ranked agents by true quality wherecumulative reward did not. In a GRPO experiment using Qwen2.5-3B-Instruct on a T4 GPU over500 steps, an LQS-inspired penalty reduced Hack Index from 1.00 to 0.00 and raised LQS from0.000 to 0.848, while cumulative reward increased only 46.5%. LearnLens is pip-installable (pipinstall learnlens-rl), compatible with Gymnasium, Stable-Baselines3, and the OpenEnv ecosystem,and fully open-source.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo