What question did this study set out to answer?

This research aims to evaluate the effectiveness of various deep learning models in foundational program analysis tasks, specifically alias and equivalence prediction.

March 19, 2026

Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis Tasks

Key Points

This research aims to evaluate the effectiveness of various deep learning models in foundational program analysis tasks, specifically alias and equivalence prediction.
Evaluated four deep learning models: CuBERT, CodeBERT, GGNN, and Graph Sandwiches.
Included four large language models: GPT3.5, GPT-4o mini, Qwen2.5-Coder, and DeepSeek Coder.
Tested models on two tasks: alias prediction and equivalence prediction using the CodeSem dataset.
Conducted a comprehensive analysis of model performance based on results.
All models demonstrated high accuracy in both alias and equivalence prediction tasks.
Deep learning models generally performed well in foundational program analysis, but weaknesses were identified in specific scenarios.
The study's findings indicate the potential of deep learning in programming language tasks.

Abstract

While deep neural networks provide state-of-the-art solutions to a wide range of programming language tasks, their effectiveness in dealing with foundational program analysis tasks remains under explored. In this paper, we present an empirical study that evaluates four prominent models of code ( i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches), plus four popular large language models ( i.e., GPT3.5 , GPT-4o mini , Qwen2.5-Coder , and DeepSeek Coder ), in two such foundational tasks: (1) alias prediction, in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction, in which models predict whether or not two programs are semantically equivalent. At the core of this study is CodeSem , a dataset built upon the source code of real-world flagship software ( e.g., Linux Kernel, GCC, MySQL) and manually validated for the two prediction tasks. Results show that all models are accurate in both prediction tasks. We also conduct a comprehensive, in-depth analysis of the results of all models in both tasks, concluding that deep learning models are generally capable of performing foundational tasks in program analysis even though in specific cases their weaknesses are also evident. Our code and evaluation data are publicly available at https://github.com/CodeSemDataset/CodeSem .

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Quan Chen

Rouyi Chen

Gang Yu

Journals

ACM Transactions on Software Engineering and Methodology

Actions

Institutions

Nanjing University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis Tasks

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study