August 7, 2025Open Access

Text adversarial attacks using policy gradients against deep learning classifiers

Key Points

The model generates adversarial text samples using a Seq2Seq encoder to map discrete data into hidden space vectors.
It employs BERT-based semantic similarity for reward calculation, emphasizing semantic preservation during perturbations.
Experiments indicate a 51.77% reduction in classification accuracy while maintaining a semantic similarity above 0.8.
The cosine similarity-based method demonstrates improved efficiency, requiring one-third to one-half the runtime of the baseline approach.

Abstract

Abstract Texts are widely used in natural language processing. However, such applications are vulnerable to adversarial attacks. Existing research attempts to artificially add semantically meaningless word‐, character‐, or sentence‐level perturbations, which compromise the syntax and consistency of texts. However, they fail to ensure high‐quality outputs. Therefore, we propose an attack model for generating adversarial samples using policy gradients and a generative adversarial network. In our model, first, a Seq2Seq encoder is used to generate sentences, mapping discrete text data into continuous hidden space vectors and then transforming them into adversarial text samples. Second, to emphasize semantics, we compute the cosine similarity or BERT‐based semantic similarity between the original and adversarial texts for reward calculation. Finally, a policy gradient is applied to optimize the parameters. Experiments show that, while maintaining a semantic similarity above 0.8, our BERT‐based method reduces classification accuracy by 51.77% on the DBpedia dataset. Our cosine similarity‐based method requires only one‐third to one‐half the runtime of the baseline approach.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper