What question did this study set out to answer?

This research aims to examine congruence between LLM-generated and human natural language through scaling laws.

January 6, 2026

Data Augmentation with Large Language Models: A Scaling Law-Guided Approach

Key Points

This research aims to examine congruence between LLM-generated and human natural language through scaling laws.
Introduced a scaling-law-based framework for evaluation.
Conducted extensive experiments comparing LLMNL and HNL.
Analyzed LLM-generated texts' variability and its methodological implications.
Proposed a new data augmentation approach for text classification.
LLMNL fails to achieve congruence with HNL.
Mandelbrot exponents show a consistent discrepancy, 0.2 lower for LLMNL.
LLMNL displays reduced fractal complexity related to stylistic factors.
New approach outperforms recent methods, ensuring competitiveness and robustness across LLMs.

Abstract

Large language models (LLM) significantly advance data augmentation research. However, existing approaches largely overlook two issues: first, empirical evidence that LLM natural language (LLMNL) aligns with human natural language (HNL) remains insufficient, which is a foundational question; second, current methodologies often neglect the variability among LLM-generated texts, potentially constraining the effectiveness of data augmentation. To address the gap, we introduce a comprehensive scaling-law-based framework for examining the congruence between LLMNL and HNL. Through extensive experiments, we uncover a progression of findings: LLMNL fails to achieve congruence with HNL; there is a consistent discrepancy, with Mandelbrot exponents for LLMNL being approximately 0.2 lower than those of HNL; LLMNL exhibits reduced fractal complexity, corroborated our analysis to stylistic factors such as readability, sentiment, and semantics. Furthermore, we propose a new data augmentation approach for text classification, which leverages scaling laws to make decisions on LLM-generated texts. Extensive experiments under real-world scenarios demonstrate the competitiveness and robustness of the approach, outperforming recent methods and consistently maintaining performance advantages across varying LLMs and prompts.

KI fragen

Bookmark

KI fragen

Bookmark

Data Augmentation with Large Language Models: A Scaling Law-Guided Approach

Key Points

Abstract

Cite This Study