Key points are not available for this paper at this time.
Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called TestChain that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13. 84\% improvement over the baseline on the LeetCode-hard dataset.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kefan Li
Nanjing University of Chinese Medicine
Yuan Yuan
Hong Kong University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Sat,) studied this question.
synapsesocial.com/papers/68e6e4f3b6db64358766014d — DOI: https://doi.org/10.48550/arxiv.2404.13340
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: