Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in the field of software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can facilitate subsequent code-related tasks, e.g., code summarization. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. It is commonly acknowledged that AST-based code representation is critical to solving code-related tasks. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. Additionally, learning an AST-based code representation is an extremely complex endeavor involving three intertwining stages, including AST parsing, AST preprocessing, and AST encoding. The solutions available in each stage are diverse. There is currently a lack of guidance on selecting solutions at each stage to get the most out of AST. In this paper, we first conduct comprehensive experiments to reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation across three popular types of code-related tasks, including code clone detection, code search, and code summarization. The experiments involve four AST parsing methods, six AST preprocessing methods, and four AST encoding methods, all of which are widely utilized in existing AST-based code representation research. The experimental results showcase that the impact of different methods at different stages varies for different code-related tasks. Based on these, we further explore the practical influence of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on the three code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with Token-based code representation consistently perform better across all three tasks compared to models trained with AST-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. For instance, in the code summarization task, such samples constitute as much as 39% of the total, while in the code search task, they account for 28%. As ASTs are now being used in practice under various contexts (a.k.a., code-related tasks), the results in this paper call for more research on context-specific AST-based code representation learning in the future. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
Sun et al. (Sat,) studied this question.