Large Language Models (LLMs), such as GPT (OpenAI), Claude (Anthropic), and Llama (Meta), achieve strong performance on standard code generation benchmarks. However, it remains challenging to determine whether these models truly reason about code or simply memorize patterns from training data. In this work, we conduct an initial study of LLM robustness by introducing minor variations in code—such as changes in style, variable names, or structure—to investigate whether models generalize to semantically equivalent programs or rely on memorized solutions. Our goal is to systematically investigate LLM robustness through code paraphrasing, applying semantic-preserving transformations to widely used benchmarks (HumanEval, MBPP, QuixBugs, LiveCodeBench, and CodeLingua). In this paper, we outline the steps towards reaching this goal: We first establish baseline model performance on the original datasets (RQ1), then evaluate the capabilities of LLMs for code paraphrasing (RQ2), and finally, we examine how different paraphrasing strategies affect model performance on the benchmark code generation tasks (RQ3). Our preliminary results show that larger models generally maintain higher correctness across paraphrased variants, with Qwen2.5-Coder models demonstrating the strongest robustness. Llama models are more sensitive to paraphrasing, showing larger drops in accuracy. These findings highlight the importance of evaluating robustness beyond standard benchmarks and provide initial steps for designing more reliable evaluation methods for code generation models.
Machacek et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: