What question did this study set out to answer?

This review aims to synthesize existing literature on LLM-based automated unit test generation, addressing key research questions.

March 28, 2026

Enhancing Automated Unit Test Generation with Large Language Models: A Systematic Literature Review

Key Points

This review aims to synthesize existing literature on LLM-based automated unit test generation, addressing key research questions.
Conducted a systematic literature review of 116 papers on LLM-based unit test generation.
Identified and categorized research on various testing tasks and adaptation strategies for LLMs.
Analyzed dataset characteristics and evaluation practices in existing studies.
Current research mainly focuses on function- and class-level test generation.
Significant disparities exist in dataset characteristics and programming language coverage.
Promising future research directions include dataset optimization and benchmark enhancement.

Abstract

Automated unit test generation is a fundamental yet challenging task in software engineering, playing a critical role in ensuring software correctness, reliability, and maintainability. While traditional approaches such as search-based software testing and symbolic execution have achieved notable success, they often suffer from limited semantic understanding, high configuration costs, and scalability constraints. Recent advances in Large Language Models (LLMs) have fundamentally reshaped the landscape of automated unit testing by enabling models to reason over source code semantics and generate executable, context-aware test cases. Despite the rapid growth of this research area, a comprehensive and task-oriented synthesis of existing work remains lacking. This paper presents a systematic literature review of LLM-based unit test generation. This review draws on research from leading SE and AI conferences and journals, including 69 papers published across 25 distinct venues, along with 47 high-quality preprint papers, bringing the total to 116. Our review aims to answer three key research questions: (1) which unit testing tasks have been addressed using LLMs, (2) how LLMs are adapted and integrated into the unit test generation pipeline, and (3) what datasets, benchmarks, and evaluation practices are employed in existing studies. To this end, we organize the literature from a task-centric perspective, covering test generation, test input generation, test oracle generation, and test evolution, and from a methodological perspective, categorizing LLM adaptation strategies into fine-tuning, prompt engineering, and agent-based approaches. Our analysis reveals that current research predominantly focuses on function- and class-level test generation, with comparatively limited attention given to test input generation, oracle construction, and long-term test evolution. Decoder-only LLMs, particularly GPT-family and LLaMA-based models, dominate the field, while encoder-only and encoder–decoder models remain underexplored. We further observe substantial disparities in dataset characteristics, programming language coverage, and evaluation metrics, which hinder fair comparison and reproducibility across studies. Based on empirical evidence extracted from the surveyed literature, we identify key challenges facing LLM-based unit test generation. Building on these findings, we outline several promising research directions, such as dataset optimization, structure-aware context modeling, agent coordination mechanisms, and benchmark enhancement. This review provides a consolidated and evidence-driven foundation for future research, aiming to advance the development of scalable, reliable, and practically applicable LLM-driven unit testing techniques.

Perguntar à IA

Bookmark