What question did this study set out to answer?

The survey aims to analyze the evolution of code generation methods and their current capabilities using large language models.

April 12, 2026Open Access

Code generation with large language models: a survey from neural program synthesis to autonomous software development

Key Points

The survey aims to analyze the evolution of code generation methods and their current capabilities using large language models.
Analyzed multiple transformer-based models pretrained on extensive code datasets.
Reviewed performance across various programming languages and task complexities.
Examined security implications and ethical considerations in model outputs.
Identified architectural and training challenges in code generation.
Models exhibit varying effectiveness in function-level synthesis and code repair across tasks.
Vulnerability rates are inconsistent and model-dependent, highlighting security risks.
Significant gaps exist in repository-level context handling and long-session consistency.
Future directions include autonomous agents and hybrid verification approaches.

Abstract

Large language models have reshaped code generation, driving a transition from rule-based and statistical methods to transformer-based architectures pretrained on vast code corpora. This survey traces the intellectual lineage from classical program synthesis through pre-transformer neural approaches to contemporary large-scale models, examining code generation capabilities across model architectures, training strategies, task taxonomies, evaluation methodologies, security implications, and ethical considerations. Contemporary models show proficiency in function-level synthesis, program repair, and documentation generation, though performance varies across programming languages and task complexities. Models ranging from 125 M to hundreds of billions of parameters are analyzed (including CodeBERT, GraphCodeBERT, Codex, AlphaCode, CodeGen, StarCoder, CodeLlama, WizardCoder, DeepSeek-Coder-V2, Yi-Coder, and GPT-4) with pass@1 accuracies on HumanEval spanning a wide range across model generations; multi-agent approaches show promise on repository-level and complex benchmarks, though all figures require cautious interpretation given data contamination risks and evaluation protocol differences. Security concerns persist, as models consistently generate vulnerable code across a range of configurations, with vulnerability rates varying substantially depending on model generation, task type, and prompt design. The survey provides critical analysis of architectural design choices, scaling law behavior for code versus natural language, training data curation challenges including legal and ethical dimensions, and the gap between benchmark performance and real-world software engineering workflows. Critical gaps are identified in handling repository-level context, maintaining consistency across extended generation sessions, and providing reliability guarantees. Future trajectories point toward autonomous software engineering agents, hybrid neuro-symbolic verification approaches, and multi-faceted evaluation frameworks, though foundational challenges in correctness verification, security assurance, and trustworthy generation remain unresolved.

AIに質問

Bookmark

View Full Paper

Cite This Study

Burak Gülmez (Wed,) studied this question.

synapsesocial.com/papers/69db38274fe01fead37c65d0 https://doi.org/https://doi.org/10.1007/s10489-026-07230-0

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark

View Full Paper