What question did this study set out to answer?

This study evaluates the performance of large language models in generating mutants for mutation testing in software engineering.

March 30, 2026

A Comprehensive Study on Large Language Models for Mutation Testing

Key Points

This study evaluates the performance of large language models in generating mutants for mutation testing in software engineering.
Conducted empirical evaluation of traditional and LLM-based mutation testing approaches.
Assessed BugFarm, LLMorpheus, and a new prompt for mutant generation.
Analyzed 851 real bugs from two Java real-world bug benchmarks.
Compared detection rates and characteristics of mutants generated by LLMs versus rule-based techniques.
LLM approaches achieved a 1.8× improvement in real bug detection, with 77.4% detection rate.
Traditional rule-based techniques had a detection rate of 41.6%.
LLM-generated mutants displayed greater behavioral diversity but had higher non-compilability, duplication, and equivalent mutant rates by 25.9, 7.1, and 2.6 percentage points respectively.

Abstract

Large Language Models (LLMs) have recently been used to generate mutants both in the research community and in industrial practice. However, there has been no comprehensive empirical study of their performance for this increasingly important LLM-based Software Engineering application. To address this, we conduct a comprehensive empirical study evaluating both a wide range of traditional approaches and LLM-based approaches. Particularly, we evaluate BugFarm, LLMorpheus, and our newly designed prompt for mutant generation. The experiments cover both leading open- and closed-source LLMs, on 851 real bugs from two Java real-world bug benchmarks. Our results reveal that, compared to existing traditional approaches, LLMs generate more diverse mutants that are behaviorally closer to real bugs and, most importantly, achieve a 1. 8 \ (\) improvement in real bug detection, defined as the proportion of real bugs whose faulty behaviors can be mimicked by at least one generated mutant. Specifically, LLM-based approaches reach a detection rate of 77. 4%, compared to 41. 6% for rule-based techniques, representing an absolute gain of 35. 8 percentage points. Nevertheless, our results also reveal that these impressive improvements in effectiveness come at a cost: LLM-generated mutants exhibit worse non-compilability, duplication, and equivalent mutant rates than rule-based approaches by 25. 9, 7. 1, and 2. 6 percentage points, respectively. These findings provide actionable insights for both research and practice. They allow practitioners to have greater confidence in deploying LLM-based mutation, while researchers now have a baseline for the state-of-the-art, with which they can research techniques to further improve effectiveness and reduce cost.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Baolong Wang

China Agricultural University

Mingda Chen

Beijing Jiaotong University

Ming Deng

Journals

ACM Transactions on Software Engineering and Methodology

Actions

Institutions

University College London

King's College London

University of Luxembourg

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Comprehensive Study on Large Language Models for Mutation Testing

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study