March 3, 2026Open Access

BMGANet: A deep learning model for source code vulnerability detection by integrating token-level and function-level features

Key Points

BMGANet achieves enhanced vulnerability detection accuracy through combined token-level and function-level features.
The model utilizes program dependence graphs and abstract syntax trees to systematically extract features for analysis.
Experimental results demonstrate that BMGANet consistently outperforms existing state-of-the-art methods on various datasets.
The approach highlights the potential of deep learning techniques in addressing semantic gaps in source code analysis.

Abstract

Deep learning is widely used in vulnerability detection due to its high accuracy. However, existing models often fail to capture both token-level and function-level features. To address this limitation, a BERT-based Multi-Granularity Attention Network (BMGANet) is proposed. In the BMGANet model, Program Dependence Graphs (PDGs) are first constructed using the Joern tool, and Abstract Syntax Trees (ASTs) are extracted according to predefined vulnerability rules. Cross-user-defined-function program slicing and code normalization are then applied to enhance analysis efficiency. Processed code slices are fed into a BERT network to extract initial token-level and function-level features. To overcome BERT’s limitation in modeling temporal dependencies, an LSTM network and a multi-head attention mechanism are sequentially employed to refine token-level features. The refined token-level features are then fused with function-level features for accurate vulnerability detection. Two pretraining tasks, namely the dynamic masked token prediction and the inter-code-line logical correlation prediction, are introduced to strengthen the model’s ability to handle semantic gaps and weak logical connections. Experimental results on both synthetic and real-world datasets show that BMGANet outperforms state-of-the-art methods.

BMGANet: A deep learning model for source code vulnerability detection by integrating token-level and function-level features

Key Points

Abstract

Cite This Study