Software engineering agents have shown strong real-world debugging capabilities, yet a core mismatch persists between multi-step, interactive deployment and training that uses static datasets of isolated code changes. This thesis presents a fully open, end-to-end system for online, execution-free Reinforcement Learning (RL) that trains Large Language Models (LLMs) inside a coding agent scaffold (Nano). The system uses live weight synchronization via NCCL to push policy updates to running inference servers during training. While industry systems exhibit agentic coding capabilities consistent with agent-based training, their methods remain undisclosed. This work provides an open, reproducible recipe. We train using Group Sequence Policy Optimization (GSPO) with a light Kullback- Leibler (KL) regularizer on a 1,000-task curriculum spanning ten programming languages, completing in 144 Graphics Processing Unit (GPU)-hours on 3 A100s. On SWE-Bench-Verified, patch submission rates (non-empty patches) rise from 37% to 78% and mean patch-similarity rewards increase by 54%, while test-verified success remains approximately flat at 6–7%. These results establish that online, execution-free RL reliably improves agent operational competence within academic compute budgets. Translating these gains to functional correctness likely requires longer training or alternative reward design. We release all infrastructure, methodology, and evaluation protocols to enable reproducible study of online RL for interactive coding agents.
Bjarni Bjarnason (Wed,) studied this question.