What question did this study set out to answer?

This work investigates whether Large Language Models (LLMs) reason about code or rely on memorization by testing their robustness against minor code variations.

February 11, 2026Open Access

Can LLMs Fool Themselves? A Preliminary Study on the Robustness of Code Generation with LLMs

Key Points

This work investigates whether Large Language Models (LLMs) reason about code or rely on memorization by testing their robustness against minor code variations.
Conducted an initial study on LLM performance using code generation benchmarks.
Investigated model responses to semantic-preserving transformations in code.
Evaluated capabilities of LLMs for code paraphrasing using specific datasets.
Larger LLMs generally maintain higher correctness across paraphrased code variants.
Qwen2.5-Coder models showed the strongest robustness in handling paraphrasing.
Llama models demonstrated sensitivity to paraphrasing, resulting in larger drops in accuracy.

Abstract

Large Language Models (LLMs), such as GPT (OpenAI), Claude (Anthropic), and Llama (Meta), achieve strong performance on standard code generation benchmarks. However, it remains challenging to determine whether these models truly reason about code or simply memorize patterns from training data. In this work, we conduct an initial study of LLM robustness by introducing minor variations in code—such as changes in style, variable names, or structure—to investigate whether models generalize to semantically equivalent programs or rely on memorized solutions. Our goal is to systematically investigate LLM robustness through code paraphrasing, applying semantic-preserving transformations to widely used benchmarks (HumanEval, MBPP, QuixBugs, LiveCodeBench, and CodeLingua). In this paper, we outline the steps towards reaching this goal: We first establish baseline model performance on the original datasets (RQ1), then evaluate the capabilities of LLMs for code paraphrasing (RQ2), and finally, we examine how different paraphrasing strategies affect model performance on the benchmark code generation tasks (RQ3). Our preliminary results show that larger models generally maintain higher correctness across paraphrased variants, with Qwen2.5-Coder models demonstrating the strongest robustness. Llama models are more sensitive to paraphrasing, showing larger drops in accuracy. These findings highlight the importance of evaluating robustness beyond standard benchmarks and provide initial steps for designing more reliable evaluation methods for code generation models.

AIに質問

Bookmark

View Full Paper