REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark

Key Points

Key points are not available for this paper at this time.

Abstract

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper