Large language models (LLM) significantly advance data augmentation research. However, existing approaches largely overlook two issues: first, empirical evidence that LLM natural language (LLMNL) aligns with human natural language (HNL) remains insufficient, which is a foundational question; second, current methodologies often neglect the variability among LLM-generated texts, potentially constraining the effectiveness of data augmentation. To address the gap, we introduce a comprehensive scaling-law-based framework for examining the congruence between LLMNL and HNL. Through extensive experiments, we uncover a progression of findings: LLMNL fails to achieve congruence with HNL; there is a consistent discrepancy, with Mandelbrot exponents for LLMNL being approximately 0.2 lower than those of HNL; LLMNL exhibits reduced fractal complexity, corroborated our analysis to stylistic factors such as readability, sentiment, and semantics. Furthermore, we propose a new data augmentation approach for text classification, which leverages scaling laws to make decisions on LLM-generated texts. Extensive experiments under real-world scenarios demonstrate the competitiveness and robustness of the approach, outperforming recent methods and consistently maintaining performance advantages across varying LLMs and prompts.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhenhua Wang
G. F. Xu
Ming Ren
ACM Transactions on Knowledge Discovery from Data
Nankai University
Renmin University of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/695d8e503483e917927a543d — DOI: https://doi.org/10.1145/3787100
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: