What does this research mean for the field?

Contrary to common assumptions, models trained with Token-based code representation consistently outperform those trained with AST-based code representation overall across multiple code-related tasks, although AST-based models excel in specific sample subsets. Novelty: ClaimNovelty.CONTRADICTORY. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

This research aims to systematically evaluate the effectiveness of different AST methods in code representation tasks.

June 8, 2026

Abstract Syntax Tree for Programming Language Understanding and Representation

Key Points

This research aims to systematically evaluate the effectiveness of different AST methods in code representation tasks.
Conducted experiments comparing four AST parsing methods, six AST preprocessing methods, and four AST encoding methods.
Evaluated the impact of these methods on three tasks: code clone detection, code search, and code summarization.
assessed performance of models using AST-based versus token-based code representation.
Models using token-based representations outperformed those using AST-based representations across all three tasks.
In specific subsets, AST-based representations excelled, with 39% better performance in code summarization and 28% in code search.
Experimental results indicate the varying impacts of AST methods necessitate further research on context-specific applications.

Abstract

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in the field of software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can facilitate subsequent code-related tasks, e.g., code summarization. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. It is commonly acknowledged that AST-based code representation is critical to solving code-related tasks. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. Additionally, learning an AST-based code representation is an extremely complex endeavor involving three intertwining stages, including AST parsing, AST preprocessing, and AST encoding. The solutions available in each stage are diverse. There is currently a lack of guidance on selecting solutions at each stage to get the most out of AST. In this paper, we first conduct comprehensive experiments to reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation across three popular types of code-related tasks, including code clone detection, code search, and code summarization. The experiments involve four AST parsing methods, six AST preprocessing methods, and four AST encoding methods, all of which are widely utilized in existing AST-based code representation research. The experimental results showcase that the impact of different methods at different stages varies for different code-related tasks. Based on these, we further explore the practical influence of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on the three code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with Token-based code representation consistently perform better across all three tasks compared to models trained with AST-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. For instance, in the code summarization task, such samples constitute as much as 39% of the total, while in the code search task, they account for 28%. As ASTs are now being used in practice under various contexts (a.k.a., code-related tasks), the results in this paper call for more research on context-specific AST-based code representation learning in the future. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

Bookmark

Abstract Syntax Tree for Programming Language Understanding and Representation

Key Points

Abstract

Cite This Study