• Docstring inclusion substantially improves test quality: +19.67pp branch coverage and +9.16pp compilation success compared to interface-only context, demonstrating that behavioural specifications are more critical than implementation details for effective test generation. • Sequential multi-turn prompting achieves superior fault detection (96.3% branch coverage, 57% mutation score) compared to simple prompting, though at 187% increased computational cost, revealing a significant cost-quality trade-off. • Gemini 2.5 Pro achieves 87% mutation score with full context — substantially exceeding both other LLMs and the 44% practitioner baseline — demonstrating that general-purpose LLMs can match or surpass human testers in fault detection for well-documented code. • All evaluated LLMs systematically omit robustness tests for special values ( None , inf , NaN ) — blind spots shared with human practitioners — indicating fundamental limitations in comprehensive scenario identification without explicit prompting. Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This exploratory study investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) from different providers, using 12 custom-developed Python methods designed specifically to avoid data leakage concerns. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields smaller incremental gains. The sequential multi-turn prompting strategy achieves the best results, with up to 96.3% branch coverage, a 57% average mutation score, and near-perfect compilation success rate. Among the six evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage, while maintaining high compilation success rates. Notably, LLM-generated tests achieved comparable or superior mutation scores compared to tests written by a software practitioner, though with varying degrees of test redundancy. While limited in scope, this study provides initial insights into optimal configurations for LLM-based test generation. All custom code developed for this study and the resulting test suites are available at https://github.com/peetery/LLM-analysis .
Walczak et al. (Sun,) studied this question.