Automated unit test generation is a fundamental yet challenging task in software engineering, playing a critical role in ensuring software correctness, reliability, and maintainability. While traditional approaches such as search-based software testing and symbolic execution have achieved notable success, they often suffer from limited semantic understanding, high configuration costs, and scalability constraints. Recent advances in Large Language Models (LLMs) have fundamentally reshaped the landscape of automated unit testing by enabling models to reason over source code semantics and generate executable, context-aware test cases. Despite the rapid growth of this research area, a comprehensive and task-oriented synthesis of existing work remains lacking. This paper presents a systematic literature review of LLM-based unit test generation. This review draws on research from leading SE and AI conferences and journals, including 69 papers published across 25 distinct venues, along with 47 high-quality preprint papers, bringing the total to 116. Our review aims to answer three key research questions: (1) which unit testing tasks have been addressed using LLMs, (2) how LLMs are adapted and integrated into the unit test generation pipeline, and (3) what datasets, benchmarks, and evaluation practices are employed in existing studies. To this end, we organize the literature from a task-centric perspective, covering test generation, test input generation, test oracle generation, and test evolution, and from a methodological perspective, categorizing LLM adaptation strategies into fine-tuning, prompt engineering, and agent-based approaches. Our analysis reveals that current research predominantly focuses on function- and class-level test generation, with comparatively limited attention given to test input generation, oracle construction, and long-term test evolution. Decoder-only LLMs, particularly GPT-family and LLaMA-based models, dominate the field, while encoder-only and encoder–decoder models remain underexplored. We further observe substantial disparities in dataset characteristics, programming language coverage, and evaluation metrics, which hinder fair comparison and reproducibility across studies. Based on empirical evidence extracted from the surveyed literature, we identify key challenges facing LLM-based unit test generation. Building on these findings, we outline several promising research directions, such as dataset optimization, structure-aware context modeling, agent coordination mechanisms, and benchmark enhancement. This review provides a consolidated and evidence-driven foundation for future research, aiming to advance the development of scalable, reliable, and practically applicable LLM-driven unit testing techniques.
Building similarity graph...
Analyzing shared references across papers
Loading...
Junwei Zhang
Xing Hu
Cuiyun Gao
ACM Transactions on Software Engineering and Methodology
Zhejiang University
Harbin Institute of Technology
Zhejiang Sci-Tech University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69c772938bbfbc51511e31ac — DOI: https://doi.org/10.1145/3802827
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: