March 3, 2026Open Access

Towards Automated Software Testing - Applying TextGrad to Test Suite Generation using LLMs

Key Points

LLMs can autonomously generate and refine comprehensive test suites for software systems, improving overall quality.
Results show that LIFT can achieve near-complete behavioral coverage and extend test suites effectively across iterations.
Evaluation of the test suite generation was performed on the Python library simplejson over multiple trials and iterations.
The findings highlight the potential and current limitations of LLM-driven quality assurance in software engineering.

Abstract

The rapid progress in Artificial Intelligence (AI) and particularly Large Language Models (LLMs) has fundamentally reshaped the field of Software Engineering (SE). Recent advances demonstrate that LLMs are not only capable of generating isolated code snippets but can perform coherent reasoning across software projects, including testing, validation, and maintenance. This thesis explores the potential of LLMs to autonomously generate, execute, and iteratively refine comprehensive test suites for real-world software systems. Building upon established SE principles and recent developments in LLM-based optimization frameworks such as TextGrad and TestART, the thesis introduces LIFT (LLM-based Iterative Feedback-driven Test suite generation) - a novel, automated, and feedback-driven approach to test suite generation. Within LIFT, specialized LLM agents act as generator, debugger, and evaluator components that cooperatively evolve test suites through textual gradients, self-assessment, and refinement loops. A case study on the Python library simplejson evaluates LIFT over multiple trials and iterations, analysing the evolution of test count, coverage, and mutation score. Results indicate that LLMs can consistently construct executable test suites, extend them meaningfully across iterations, and approach near-complete behavioral coverage when given sufficient computation power and context. These findings suggest that LLMs are capable of performing complex reasoning about software structure and behavior, moving beyond traditional unit test generation toward a holistic understanding of program correctness. Yet, to date, assertion quality is not on par, highlighting the continued impact of the Oracle Problem within LLM-based test generation.The implemented framework is made available under: Sarius32/LIFT The thesis contributes an empirical foundation for autonomous, LLM-driven quality assurance and discusses implications for scalability, tool integration, and future research. By aligning generative AI with core SE practices, it highlights a promising path toward self-improving software testing systems that bridge human expertise and machine reasoning.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Moritz Gärtner (Sat,) studied this question.

synapsesocial.com/papers/69a75b0fc6e9836116a21aad https://doi.org/https://doi.org/10.5281/zenodo.18394393

Bookmark

View Full Paper