Large Language Models (LLMs) aim to generate and understand human-like text by leveraging deep learning and natural language processing techniques. In software development, LLMs can enhance the coding experience through coding automation, reducing development time and improving code quality. Code refactoring is a technique used to enhance the internal quality of the code base without altering its external functionalities. Leveraging LLMs for code refactoring can help developers improve code quality with minimal effort. This paper presents an empirical study evaluating the quality of refactored code produced by StarCoder2, GPT-4o-mini, GPT-4o, LLaMA 3, and DeepSeek-v3. Specifically, we (1) evaluate whether the code refactored by the LLMs can improve code quality, (2) understand the differences between the types of refactoring applied by the different LLMs and compare their effectiveness, and (3) evaluate whether the quality of the refactored code generated by the LLM can be improved through one-shot prompting and chain-of-thought prompting. We analyze the refactoring capabilities of LLMs on 30 open-source Java projects. We evaluate StarCoder2, LLaMA 3, GPT-4o-mini, GPT-4o, and DeepSeek-v3 on their ability to improve static code quality metrics, pass unit tests, and reduce code smells. Our findings reveal that production-grade models such as GPT-4o and DeepSeek-v3 achieve pass@5 unit test success rates above 90% on multi-file refactorings. LLaMA 3 achieves the highest overall code smell reduction with a median reduction of 15.1%, while DeepSeek-v3 and GPT-4o achieve the greatest improvements in cohesion, coupling, and complexity. StarCoder2 demonstrates strengths in modularity improvements and systematic refactorings. Developers outperform LLMs in complex, context-sensitive refactorings such as attribute encapsulation. We also show that prompt engineering significantly affects LLM performance: chain-of-thought prompting improves StarCoder2's test pass rate by 1.7% and increases code smell reduction compared to zero-shot prompting. One-shot prompting also expands the variety of refactorings LLMs can perform. These results suggest that LLMs are effective for many refactoring tasks, especially when guided with tailored prompts, but benefit from integration with human expertise for architectural or semantically complex changes. By providing insights into the capabilities and best practices for integrating LLMs into the software development process, our study aims to enhance the effectiveness and efficiency of code refactoring in real-world applications.
Cordeiro et al. (Tue,) studied this question.