Automated tests are a key practice adopted by the software industry to verify software quality. However, they are costly to develop and maintain. Recently, the use of LLMs to generate automated tests has been explored as a viable alternative. Ongoing efforts focus on improving generation by providing richer context and post-processing the output to correct errors, ensuring accurate results. However, small-scale open LLMs, capable of running on modest hardware, have received limited attention. This work compares large-scale LLMs (e.g., GPT and Gemini) with small-scale open-source models in terms of the number of tests generated and their quality, measured by the mutation score, the cyclomatic complexity of generated code, and the number of test smells on them. We evaluated 12 small-scale models against 6 large-scale ones and used EvoSuite to establish a baseline for code quality and the number of methods tested. Our results show that some small-scale LLMs perform well in test generation tasks. xLan, Gemma2, and DeepSeekCoder gave the best overall results, producing as many tests as large-scale models, with fewer smells and a better mutation score.
Building similarity graph...
Analyzing shared references across papers
Loading...
E. C. Silva
Roberta Coelho
Lyrene Fernandes da Silva
Building similarity graph...
Analyzing shared references across papers
Loading...
Silva et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e9b1c1ba7d64b6fc132194 — DOI: https://doi.org/10.5753/sbes.2025.9618