What question did this study set out to answer?

The study evaluates the effectiveness of various LLMs in improving code quality through refactoring.

March 13, 2026

An Empirical Study on the Code Refactoring Capability of Large Language Models

Key Points

The study evaluates the effectiveness of various LLMs in improving code quality through refactoring.
Analyzed refactoring capabilities of LLMs on 30 open-source Java projects.
Evaluated improvements in static code quality metrics and unit test success rates.
Compared performance of models including StarCoder2, GPT-4o-mini, GPT-4o, LLaMA 3, and DeepSeek-v3.
Assessed impact of one-shot and chain-of-thought prompting on refactoring tasks.
GPT-4o and DeepSeek-v3 achieved over 90% success rates in passing unit tests for multi-file refactorings.
LLaMA 3 had the highest median code smell reduction of 15.1%.
DeepSeek-v3 and GPT-4o showed notable improvements in cohesion, coupling, and complexity.
StarCoder2 excelled in modularity and systematic refactorings.
Chain-of-thought prompting improved StarCoder2's performance by 1.7% in test pass rates.

Abstract

Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-4o-mini, GPT-4o, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. We evaluate StarCoder2, LLaMA 3, GPT-4o-mini, GPT-4o, and DeepSeek-v3 on their ability to improve static code quality metrics, pass unit tests, and reduce code smells. Our findings reveal that production-grade models such as GPT-4o and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-4o achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.

Bookmark

An Empirical Study on the Code Refactoring Capability of Large Language Models

Key Points

Abstract

Cite This Study