Large Language Models (LLMs) have recently been used to generate mutants both in the research community and in industrial practice. However, there has been no comprehensive empirical study of their performance for this increasingly important LLM-based Software Engineering application. To address this, we conduct a comprehensive empirical study evaluating both a wide range of traditional approaches and LLM-based approaches. Particularly, we evaluate BugFarm, LLMorpheus, and our newly designed prompt for mutant generation. The experiments cover both leading open- and closed-source LLMs, on 851 real bugs from two Java real-world bug benchmarks. Our results reveal that, compared to existing traditional approaches, LLMs generate more diverse mutants that are behaviorally closer to real bugs and, most importantly, achieve a 1. 8 \ (\) improvement in real bug detection, defined as the proportion of real bugs whose faulty behaviors can be mimicked by at least one generated mutant. Specifically, LLM-based approaches reach a detection rate of 77. 4%, compared to 41. 6% for rule-based techniques, representing an absolute gain of 35. 8 percentage points. Nevertheless, our results also reveal that these impressive improvements in effectiveness come at a cost: LLM-generated mutants exhibit worse non-compilability, duplication, and equivalent mutant rates than rule-based approaches by 25. 9, 7. 1, and 2. 6 percentage points, respectively. These findings provide actionable insights for both research and practice. They allow practitioners to have greater confidence in deploying LLM-based mutation, while researchers now have a baseline for the state-of-the-art, with which they can research techniques to further improve effectiveness and reduce cost.
Building similarity graph...
Analyzing shared references across papers
Loading...
Baolong Wang
China Agricultural University
Mingda Chen
Beijing Jiaotong University
Ming Deng
ACM Transactions on Software Engineering and Methodology
University College London
King's College London
University of Luxembourg
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Sat,) studied this question.
synapsesocial.com/papers/69ca1369883daed6ee095498 — DOI: https://doi.org/10.1145/3805038