What question did this study set out to answer?

To investigate how well large language models can identify semantically equivalent and inequivalent programs.

March 29, 2026Open Access

Understanding code semantics: a benchmark study of LLMs

Key Points

To investigate how well large language models can identify semantically equivalent and inequivalent programs.
Used a benchmark of 11 Python functions with equivalent and non-equivalent variants.
Evaluated seven state-of-the-art LLMs under zero-shot prompting conditions.
Introduced semantics-preserving code transformations like copy propagation and constant folding.
Models misclassified 41% of equivalent cases without context and 29% with context.
Prompts improved performance but didn't solve fundamental model limitations.
Targeted fine-tuning and contrastive learning were suggested as pathways to improve understanding.

Abstract

Abstract We present an empirical study on the ability of Large Language Models (LLMs) to understand code by detecting semantically equivalent and inequivalent programs, that is, whether they compute the same result given the same input or not. To probe this, we deliberately perturb the program text by introducing semantics-preserving code transformations, namely copy propagation and constant folding. Using a benchmark of 11 Python functions with both equivalent and non-equivalent variants, we evaluate seven state-of-the-art LLMs (including ChatGPT, Claude, Gemini, and Deep-Seek) under zero-shot prompting, with and without minimal context. Despite strong performance in code generation tasks, the models often fail in this deeper reasoning challenge, misclassifying 41% of equivalent cases without context and 29% with context. Although prompting can improve performance, it does not address the underlying limitations of the models. We argue that improving LLMs themselves, through targeted fine-tuning, contrastive learning on equivalent and nonequivalent implementations, or training on transformation-invariant code, will be necessary for robust semantic understanding. Meanwhile, practitioners can achieve better results by selecting stronger models, carefully engineering prom-pts, or writing code with tools that normalize low-level differences before inference.

Mark Helpful

Bookmark

Relay

View Full Paper