What type of study is this?

This is a Quantitative Study study.

October 16, 2025Open Access

Disproving Program Equivalence with LLMs

Key Points

ProbeGen disproves 18% of code samples deemed equivalent by existing unit tests, revealing inadequacies.
With execution feedback, large language models excel at identifying counterexamples in code equivalence.
Semantic clustering of LLM samples can enhance pass rates, improving performance by 10% on benchmarks.
This method highlights the necessity of refined testing methods for better code evaluation and validation.

Abstract

To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Allamanis et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68f0f51d8dd8ea469b1d6e44 — DOI: https://doi.org/10.48550/arxiv.2502.18473

Authors

Miltiadis Allamanis

Pengcheng Yin

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Disproving Program Equivalence with LLMs

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion