Whether a counterexample is genuine or spurious fundamentally influences the effectiveness and completeness of loop invariant inference, which is a core component of automated program verification. However, reliably determining the validity of a counterexample remains a challenging task. In this paper, we present a systematic evaluation of large language models (LLMs) on this problem. We construct a benchmark of program states that serve as counterexamples, categorized into three representative types: (i) pre-states of inductive counterexamples derived from LLM-proposed invariants and (ii–iii) boundary states derived from correct inductive invariants, where the states themselves either violate (ii) or satisfy (iii) the program’s precondition. Ground-truth labels are established using a state-of-the-art program verifier. We evaluate multiple LLMs under diverse prompting strategies. Our results show that LLMs perform well on the first two types of counterexamples in the benchmark but poorly on the third. Moreover, LLMs are substantially more accurate in classifying spurious counterexamples than genuine ones. These findings offer valuable guidance for future research on LLM-assisted loop invariant inference.
Fan et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: