The rapid progress in Artificial Intelligence (AI) and particularly Large Language Models (LLMs) has fundamentally reshaped the field of Software Engineering (SE). Recent advances demonstrate that LLMs are not only capable of generating isolated code snippets but can perform coherent reasoning across software projects, including testing, validation, and maintenance. This thesis explores the potential of LLMs to autonomously generate, execute, and iteratively refine comprehensive test suites for real-world software systems. Building upon established SE principles and recent developments in LLM-based optimization frameworks such as TextGrad and TestART, the thesis introduces LIFT (LLM-based Iterative Feedback-driven Test suite generation) - a novel, automated, and feedback-driven approach to test suite generation. Within LIFT, specialized LLM agents act as generator, debugger, and evaluator components that cooperatively evolve test suites through textual gradients, self-assessment, and refinement loops. A case study on the Python library simplejson evaluates LIFT over multiple trials and iterations, analysing the evolution of test count, coverage, and mutation score. Results indicate that LLMs can consistently construct executable test suites, extend them meaningfully across iterations, and approach near-complete behavioral coverage when given sufficient computation power and context. These findings suggest that LLMs are capable of performing complex reasoning about software structure and behavior, moving beyond traditional unit test generation toward a holistic understanding of program correctness. Yet, to date, assertion quality is not on par, highlighting the continued impact of the Oracle Problem within LLM-based test generation.The implemented framework is made available under: Sarius32/LIFT The thesis contributes an empirical foundation for autonomous, LLM-driven quality assurance and discusses implications for scalability, tool integration, and future research. By aligning generative AI with core SE practices, it highlights a promising path toward self-improving software testing systems that bridge human expertise and machine reasoning.
Moritz Gärtner (Sat,) studied this question.