October 11, 2025

LLMs as Test Generators: A Comparative Benchmarking Study

Key Points

Some small-scale LLMs produce quality automated tests on par with large-scale models, showcasing their potential.
The evaluation included 12 small-scale and 6 large-scale models, using mutation score and cyclomatic complexity as key metrics.
EvoSuite was employed to establish a benchmark for evaluating the quality of the generated code and test methods.
Results indicated that models like xLan and Gemma2 matched large-scale counterparts in test generation quality.

Abstract

Automated tests are a key practice adopted by the software industry to verify software quality. However, they are costly to develop and maintain. Recently, the use of LLMs to generate automated tests has been explored as a viable alternative. Ongoing efforts focus on improving generation by providing richer context and post-processing the output to correct errors, ensuring accurate results. However, small-scale open LLMs, capable of running on modest hardware, have received limited attention. This work compares large-scale LLMs (e.g., GPT and Gemini) with small-scale open-source models in terms of the number of tests generated and their quality, measured by the mutation score, the cyclomatic complexity of generated code, and the number of test smells on them. We evaluated 12 small-scale models against 6 large-scale ones and used EvoSuite to establish a baseline for code quality and the number of methods tested. Our results show that some small-scale LLMs perform well in test generation tasks. xLan, Gemma2, and DeepSeekCoder gave the best overall results, producing as many tests as large-scale models, with fewer smells and a better mutation score.

KI fragen

Bookmark

Cite This Study

Silva et al. (Mon,) studied this question.

synapsesocial.com/papers/68e9b1c1ba7d64b6fc132194 https://doi.org/https://doi.org/10.5753/sbes.2025.9618

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark