What question did this study set out to answer?

This thesis investigates the performance of AI coding agents in repairing multi-hunk defects in Python code.

May 24, 2026Open Access

Agentic multi-hunk repair on SWE-bench verified

Puntos clave

This thesis investigates the performance of AI coding agents in repairing multi-hunk defects in Python code.
Evaluated four production agents: Claude Code, Codex, Gemini-cli, and Qwen Code.
Used the curated multi-hunk subset Hunk-SWE consisting of thirty-two GitHub issues with accompanying test suites.
Measured accuracy, localization rates, and operational efficiency using a Docker pipeline and structural metrics.
Codex resolves 90.62% of defects, while Claude Code resolves 87.50%.
Localizations rates for Claude Code (90.62%) are highest, with a small gap in repair success.
Operational behaviors show significant variations in execution time for failed attempts across different agents.

Resumen

Most research on automated program repair (APR) addresses defects whose fix occupies a single contiguous code region. However, a substantial proportion of real-world bugs require coordinated edits at multiple disjoint locations. These multi-hunk defects pose a distinct challenge: the unit of correctness is the complete patch, so an edit that is locally correct but globally inconsistent invalidates the repair. This thesis investigates how contemporary tool-using coding agents perform in this setting on real-world Python code. The empirical study introduces Hunk-SWE, a curated multi-hunk subset of SWE-bench Verified comprising thirty-two human-validated GitHub issues from twelve established Python projects. Each instance is accompanied by the buggy commit, the developer’s gold patch, and an executable test suite. Four production agents—Claude Code, Codex, Gemini-cli, and Qwen Code—are evaluated end-to-end on Hunk-SWE using a per-instance Docker pipeline that pins both the agent CLI and the SWE-bench evaluation image. Final pass/fail judgment is delegated to the official swebench.harness grader, ensuring verdicts identical to the public leaderboard. The thesis additionally reports two further families of metrics: file-level localization and operational efficiency, measured in tokens and runtime. The BIRCH structural metrics of hunk divergence and spatial proximity are generalized from a Java-only formulation to a multi-language formulation and applied to every instance of Hunk-SWE. Repair accuracy varies substantially across agents on Hunk-SWE. Codex resolves 29 of 32 instances (90.62%), Claude Code 28 of 32 (87.50%), Gemini-cli 19 of 32 (59.38%), and Qwen Code 13 of 32 (40.62%). The two lower-tier agents are strictly nested into the upper tier: every bug resolved by Qwen Code is also resolved by Gemini-cli, and every bug resolved by Gemini-cli is also resolved by both Codex and Claude Code. Codex and Claude Code are not strictly nested with respect to one another, however; the two agents differ on three bugs, and their union covers the same 30 instances as the union of all four agents. Codex and Gemini-cli attain comparable file-level localization rates (78.12% and 81.25% respectively), indicating that the 31-percentage-point gap in repair accuracy between them arises from synthesis rather than from navigation. Claude Code attains the highest localization rate at 90.62% and the smallest localization–repair gap. Repair success declines as edits become more divergent and more dispersed, consistent with prior findings on Java multi-hunk repair. Operational behavior also differs across agents: Gemini-cli’s failed attempts are approximately 5.2× longer than its successful ones (950s vs. 182s), Qwen Code’s failures are 1.5× longer (422s vs. 284s), Claude Code’s failures are 1.3× longer (457s vs. 348s), and Codex’s failures are slightly shorter than its successes (224s vs. 264s). These results indicate that accuracy and cost-of-failure are largely independent dimensions and should be reported together.

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo