What question did this study set out to answer?

The study aims to benchmark large language models for unit test generation against new datasets that reflect real-world complexity.

March 30, 2026

Benchmarking LLMs for Unit Test Generation from Real-World Functions

Key Points

The study aims to benchmark large language models for unit test generation against new datasets that reflect real-world complexity.
Developed ULT benchmark focusing on cyclomatic complexity using real-world Python functions.
Introduced PLT for controlled analysis of LLM memorization versus reasoning.
Conducted a large-scale empirical study involving 12 state-of-the-art LLMs to compare performance metrics.
LLMs scored 41.32% accuracy and significantly lower on test coverage metrics compared to established benchmarks.
ULT correlated better with code generation performance (0.79) than other benchmarks, implying improved test generation measurement.

Abstract

Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, scientific conclusions drawn from empirical studies using these benchmarks may be compromised. The evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions. ULT is constructed through a multi-stage curation process that ensures high cyclomatic complexity and mitigates test case contamination. With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs’ test generation capabilities. We also provide PLT (PreLeakedTestbench), a pair benchmark of ULT with leaked tests designed to enable a controlled analysis of memorization versus reasoning in test generation. Based on the two datasets, we conduct a large-scale empirical study involving 12 state-of-the-art LLMs, comparing their performance against established benchmarks. Our evaluation results demonstrate that ULT is significantly more challenging. For example, test cases generated by LLMs only achieve 41.32%, 45.10%, 30.22%, and 40.21% for accuracy, statement coverage, branch coverage, and mutation score on average for all LLMs, respectively. These results are substantially lower than the corresponding metrics on TestEval (91.79%, 92.18%, 82.04%, and 49.69%) and PLT (47.07%, 55.13%, 40.07%, and 50.80%). In addition, different from existing benchmarks, ULT shows a strong correlation between test generation performance and code generation performance. For example, the correlation coefficient between the coding ability and test generation performance ( \(Pass@1\) ) on ULT is 0.79 (p = 0.002), while it is only 0.56 (p = 0.059) and 0.52 (p = 0.080) on TestEval and PLT, respectively. This indicates that ULT more effectively measures the generalization ability of LLMs. We also make ULT and evaluation results publicly available to foster further research 1 . ULT is available at https://github.com/huangd1999/UnLeakedTestBench .

AIに質問

Bookmark

AIに質問

Bookmark

Benchmarking LLMs for Unit Test Generation from Real-World Functions

Key Points

Abstract

Cite This Study