Abstract Texts are widely used in natural language processing. However, such applications are vulnerable to adversarial attacks. Existing research attempts to artificially add semantically meaningless word‐, character‐, or sentence‐level perturbations, which compromise the syntax and consistency of texts. However, they fail to ensure high‐quality outputs. Therefore, we propose an attack model for generating adversarial samples using policy gradients and a generative adversarial network. In our model, first, a Seq2Seq encoder is used to generate sentences, mapping discrete text data into continuous hidden space vectors and then transforming them into adversarial text samples. Second, to emphasize semantics, we compute the cosine similarity or BERT‐based semantic similarity between the original and adversarial texts for reward calculation. Finally, a policy gradient is applied to optimize the parameters. Experiments show that, while maintaining a semantic similarity above 0.8, our BERT‐based method reduces classification accuracy by 51.77% on the DBpedia dataset. Our cosine similarity‐based method requires only one‐third to one‐half the runtime of the baseline approach.
Zeng et al. (Thu,) studied this question.