What question did this study set out to answer?

This study investigates how code context and different prompting strategies influence the quality of automated unit test generation using large language models.

March 3, 2026Open Access

Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models

Key Points

This study investigates how code context and different prompting strategies influence the quality of automated unit test generation using large language models.
Exploratory study using 12 custom Python methods to avoid data leakage
Evaluation of various prompting strategies including sequential multi-turn and simple prompting
Comparison of generated unit tests from multiple large language models
Including docstrings improves branch coverage by 19.67 percentage points
Sequential multi-turn prompting achieves 96.3% branch coverage and 57% mutation score
Gemini 2.5 Pro outperforms other models and human practitioners with 87% mutation score

Abstract

• Docstring inclusion substantially improves test quality: +19.67pp branch coverage and +9.16pp compilation success compared to interface-only context, demonstrating that behavioural specifications are more critical than implementation details for effective test generation. • Sequential multi-turn prompting achieves superior fault detection (96.3% branch coverage, 57% mutation score) compared to simple prompting, though at 187% increased computational cost, revealing a significant cost-quality trade-off. • Gemini 2.5 Pro achieves 87% mutation score with full context — substantially exceeding both other LLMs and the 44% practitioner baseline — demonstrating that general-purpose LLMs can match or surpass human testers in fault detection for well-documented code. • All evaluated LLMs systematically omit robustness tests for special values ( None , inf , NaN ) — blind spots shared with human practitioners — indicating fundamental limitations in comprehensive scenario identification without explicit prompting. Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This exploratory study investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) from different providers, using 12 custom-developed Python methods designed specifically to avoid data leakage concerns. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields smaller incremental gains. The sequential multi-turn prompting strategy achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the six evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage, while maintaining high compilation success rates. Notably, LLM-generated tests achieved comparable or superior mutation scores compared to tests written by a software practitioner, though with varying degrees of test redundancy. While limited in scope, this study provides initial insights into optimal configurations for LLM-based test generation. All custom code developed for this study and the resulting test suites are available at https://github.com/peetery/LLM-analysis .

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Walczak et al. (Sun,) studied this question.

synapsesocial.com/papers/69a67f06f353c071a6f0adf0 https://doi.org/https://doi.org/10.1016/j.jss.2026.112834

Bookmark

View Full Paper