What question did this study set out to answer?

This study aims to explore evaluation frameworks and enhancement strategies for code generated by large language models (LLMs).

May 16, 2026Open Access

An overview of evaluation and enhancement methods for code generation by large language models

Key Points

This study aims to explore evaluation frameworks and enhancement strategies for code generated by large language models (LLMs).
Conducted a literature review analyzing existing research on LLM code evaluation frameworks and enhancement strategies.
Developed a code quality dimension taxonomy based on the ISO/IEC 25010 standard, focusing on four attributes: Functional Correctness, Security, Performance Efficiency, and Maintainability.
Identified gaps in research focus, particularly regarding Performance Efficiency and Maintainability evaluation.
Established evaluation frameworks exist for Functional Correctness and Security, while Performance Efficiency and Maintainability are significantly underexamined.
Identified the need for formal benchmarks and refined strategies for Performance Efficiency and Maintainability.
Suggested directions for future research include reinforcement learning techniques and multi-agent frameworks for code improvement.

Abstract

Recent advances in Large Language Models (LLMs) have led to the rapid deployment of automated generation tools capable of producing source code. As these models increasingly transition from being experimental tools to established elements of the software development, a critical question arises: to what extent do the models and the code they generate satisfy, or can be made to satisfy, the rigorous, multifaceted quality standards required for professional, real-world engineering? The primary aim of this study is to find the answer to this question by exploring existing evaluation frameworks and enhancement strategies for LLMs and the code they generate. By examining how generated code quality is currently assessed and improved, we hope to determine if the current research methodologies provide a balanced coverage of the software quality spectrum or if significant disparities exist. We propose a code quality dimension taxonomy adapted from the ISO/IEC 25010 standard, encompassing four principal attributes: Functional Correctness (FC), Security (SE), Performance Efficiency (PE), and Maintainability (MA). Using this framework, we conduct a literature review analysing existing research in evaluation frameworks and enhancement strategies across these dimensions. Our analysis reveals a substantial imbalance in research focus. FC, and increasingly, SE have well-established evaluation frameworks and improvement strategies. In contrast, PE and MA remain significantly underexamined, with few standardised benchmarks and a lack of targeted fine-tuning approaches for these critical software quality dimensions. The survey identifies a pressing need for broader research into PE and MA-oriented evaluation and enhancement. We propose several promising directions: (i) the creation of formal benchmarks; (ii) the development of reinforcement learning techniques leveraging static and dynamic code feedback; and (iii) the use of multi-agent frameworks for iterative, critique-based improvement grounded in verifiable diagnostic artefacts.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper