June 30, 2024Open Access

Need Text Data Augmentation? Just One Insertion Is Enough

Key Points

Key points are not available for this paper at this time.

Abstract

Data augmentation generates additional samples for data expansion.The modification method is an augmentation technique that is commonly used because of its simplicity.This method modifies the words in sentences using simple rules.It has the advantages of low complexity and cost, because it simply needs to sequentially scan a dataset without requiring complex computations.Despite its simplicity, there is a drawback.It uses only the training dataset corpus, leading to the repeated learning of the same words and limited diversity.In this study, we propose STOP-SYM, which is simpler and more effective than previous methods, while addressing its drawbacks.In previous simple data-augmentation methods, various operations, such as delete, insert, replace and swap were used to inject diverse noise.The proposed method, STOP-SYM, generates sentences by simply inserting words.STOP-SYM uses the intersection of out-of-vocabulary (OOV) words and stopword synonyms.OOV enables the use of a corpus beyond the training dataset, and synonyms of stopwords minimize their impact on training as white noise.By inserting these words into sentences, augmented samples that increase the diversity of the dataset can be easily obtained.Ultimately, compared with recent simple data-augmentation methods, our approach demonstrates superior performance.We also conducted comparative experiments on various text-classification datasets and a GPT-based model to demonstrate its superiority.

Need Text Data Augmentation? Just One Insertion Is Enough

Key Points

Abstract

Cite This Study