August 1, 2025Open Access

Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent

Key Points

MAIN FINDING: Humans refine their reinforcement learning strategies akin to policy gradient ascent methods.
KEY EVIDENCE: DynamicRL outperforms fixed-parameter models by tracking adaptive learning parameters in real-time.
APPROACH: The study used neural networks to analyze how learning rates and decision temperatures evolve.
SIGNIFICANCE: This method enhances understanding of meta-learning and adaptive behavior in both biological and artificial systems.

Abstract

A hallmark of intelligence is the ability to adapt behavior to changing environments, which requires adapting one's own learning strategies. This phenomenon is known as learning to learn in cognitive science and meta-learning in artificial intelligence. While this phenomenon is well-established in humans and animals, no quantitative framework exists for characterizing the trajectories through which biological agents adapt their learning strategies. Previous computational studies that either assume fixed strategies or use task-optimized neural networks do not explain how humans refine strategies through experience. Here we show that humans adjust their reinforcement learning strategies resembling principles of gradient-based online optimization. We introduce DynamicRL, a framework using neural networks to track how participants' learning parameters (e.g., learning rates and decision temperatures) evolve throughout experiments. Across four diverse bandit tasks, DynamicRL consistently outperforms traditional reinforcement learning models with fixed parameters, demonstrating that humans continuously adapt their strategies over time. These dynamically-estimated parameters reveal trajectories that systematically increase expected rewards, with updates significantly aligned with policy gradient ascent directions. Furthermore, this learning process operates across multiple timescales, with strategy parameters updating more slowly than behavioral choices, and update effectiveness correlates with local gradient strength in the reward landscape. Our work offers a generalizable approach for characterizing meta-learning trajectories, bridging theories of biological and artificial intelligence by providing a quantitative method for studying how adaptive behavior is optimized through experience.

Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent

Key Points

Abstract

Cite This Study

Also Consider

Also Consider