Key points are not available for this paper at this time.
Recent advances in Large Language Models (LLMs) have led to the rapid deployment of automated generation tools capable of producing source code. As these models increasingly transition from being experimental tools to established elements of the software development, a critical question arises: to what extent do the models and the code they generate satisfy, or can be made to satisfy, the rigorous, multifaceted quality standards required for professional, real-world engineering? The primary aim of this study is to find the answer to this question by exploring existing evaluation frameworks and enhancement strategies for LLMs and the code they generate. By examining how generated code quality is currently assessed and improved, we hope to determine if the current research methodologies provide a balanced coverage of the software quality spectrum or if significant disparities exist. We propose a code quality dimension taxonomy adapted from the ISO/IEC 25010 standard, encompassing four principal attributes: Functional Correctness (FC), Security (SE), Performance Efficiency (PE), and Maintainability (MA). Using this framework, we conduct a literature review analysing existing research in evaluation frameworks and enhancement strategies across these dimensions. Our analysis reveals a substantial imbalance in research focus. FC, and increasingly, SE have well-established evaluation frameworks and improvement strategies. In contrast, PE and MA remain significantly underexamined, with few standardised benchmarks and a lack of targeted fine-tuning approaches for these critical software quality dimensions. The survey identifies a pressing need for broader research into PE and MA-oriented evaluation and enhancement. We propose several promising directions: (i) the creation of formal benchmarks; (ii) the development of reinforcement learning techniques leveraging static and dynamic code feedback; and (iii) the use of multi-agent frameworks for iterative, critique-based improvement grounded in verifiable diagnostic artefacts.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jacob Truong
Van Nguyen
Thanh Thi Nguyen
Information and Software Technology
Monash University
University of the Sunshine Coast
Building similarity graph...
Analyzing shared references across papers
Loading...
Truong et al. (Tue,) studied this question.
www.synapsesocial.com/papers/6a0808afa487c87a6a40af0d — DOI: https://doi.org/10.1016/j.infsof.2026.108185
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: