Abstract We present an empirical study on the ability of Large Language Models (LLMs) to understand code by detecting semantically equivalent and inequivalent programs, that is, whether they compute the same result given the same input or not. To probe this, we deliberately perturb the program text by introducing semantics-preserving code transformations, namely copy propagation and constant folding. Using a benchmark of 11 Python functions with both equivalent and non-equivalent variants, we evaluate seven state-of-the-art LLMs (including ChatGPT, Claude, Gemini, and Deep-Seek) under zero-shot prompting, with and without minimal context. Despite strong performance in code generation tasks, the models often fail in this deeper reasoning challenge, misclassifying 41% of equivalent cases without context and 29% with context. Although prompting can improve performance, it does not address the underlying limitations of the models. We argue that improving LLMs themselves, through targeted fine-tuning, contrastive learning on equivalent and nonequivalent implementations, or training on transformation-invariant code, will be necessary for robust semantic understanding. Meanwhile, practitioners can achieve better results by selecting stronger models, carefully engineering prom-pts, or writing code with tools that normalize low-level differences before inference.
Laneve et al. (Fri,) studied this question.