What question did this study set out to answer?

This research evaluates the effectiveness of large language models (LLMs) in generating unit tests.

March 30, 2026Open Access

Prompt engineering in LLMs for automated unit test generation: A large-scale study

Key Points

This research evaluates the effectiveness of large language models (LLMs) in generating unit tests.
Conducted large-scale evaluation of LLMs across 216,300 generated test cases.
Analyzed performance of models like GPT-3.5, GPT-4, and others against EvoSuite.
Evaluated prompting techniques including Few-Shot Learning and their impact on tests.
Found Few-Shot Learning enhances test generation effectiveness compared to traditional methods.
Highlighted high compilation failure rates in LLM-generated tests, up to 86%.
Identified recurring design issues in generated tests affecting maintainability.

Abstract

Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, systematically analyzing four state-of-the-art models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 generated test cases targeting Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques–Zero-Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT)–assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and test smells. Reasoning-based prompting particularly GToT significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Moreover, test smell analysis reveals that while LLM-generated tests are generally more readable than those produced by traditional tools, they still suffer from recurring design issues such as Magic Number Tests and Assertion Roulette, which hinder maintainability. Overall, our findings indicate that LLMs can serve as effective assistive tools for generating readable and maintainable test suites, but hybrid approaches that combine LLM-based generation with automated validation and search-based refinement are required to achieve reliable and production-ready test generation.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper