What type of study is this?

This is a Quantitative Study study.

September 29, 2025Open Access

VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

Key Points

Less than 4% pass rate observed in sota models on the VerifyThisBench benchmark, indicating significant challenges.
Evaluate evaluations reveal many outputs fail to compile, demonstrating limitations in reasoning and verification.
VerifyThisBenchXS offers reduced task complexity by providing partial implementations or proofs for ease of assessment.
Systematic assessments of sota models reveal strengths and weaknesses in formal reasoning and verification capabilities.

Abstract

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs, offering limited insight into deeper reasoning capabilities. We introduce VerifyThisBench, a new benchmark designed to evaluate LLMs on end-to-end program verification tasks that require interpreting natural language problem descriptions, formulating formal specifications, generating code, and constructing correctness proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To reduce task complexity, we further propose VerifyThisBenchXS, a variant in which partial implementations or proofs are provided. We systematically assess SOTA models on both benchmarks, uncovering key strengths and limitations in their formal reasoning and verification capabilities.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper