March 3, 2026Open Access

Learning Agency in the Terminal with Repository-Level Reinforcement Learning

Key Points

Operational competence improves significantly, with patch submission rates increasing from 37% to 78%.
The training employs Group Sequence Policy Optimization to enhance agent capabilities in coding across ten languages.
Live synchronization updates are integrated into the reinforcement learning system, enhancing learning efficiency.
These findings highlight the need for further exploration into reward design for better functional correctness.

Abstract

Software engineering agents have shown strong real-world debugging capabilities, yet a core mismatch persists between multi-step, interactive deployment and training that uses static datasets of isolated code changes. This thesis presents a fully open, end-to-end system for online, execution-free Reinforcement Learning (RL) that trains Large Language Models (LLMs) inside a coding agent scaffold (Nano). The system uses live weight synchronization via NCCL to push policy updates to running inference servers during training. While industry systems exhibit agentic coding capabilities consistent with agent-based training, their methods remain undisclosed. This work provides an open, reproducible recipe. We train using Group Sequence Policy Optimization (GSPO) with a light Kullback- Leibler (KL) regularizer on a 1,000-task curriculum spanning ten programming languages, completing in 144 Graphics Processing Unit (GPU)-hours on 3 A100s. On SWE-Bench-Verified, patch submission rates (non-empty patches) rise from 37% to 78% and mean patch-similarity rewards increase by 54%, while test-verified success remains approximately flat at 6–7%. These results establish that online, execution-free RL reliably improves agent operational competence within academic compute budgets. Translating these gains to functional correctness likely requires longer training or alternative reward design. We release all infrastructure, methodology, and evaluation protocols to enable reproducible study of online RL for interactive coding agents.

Learning Agency in the Terminal with Repository-Level Reinforcement Learning

Key Points

Abstract

Cite This Study