Key points are not available for this paper at this time.
Large language models increasingly mediate real-world tasks, yet we lack systematic ways to quantify how their performance degrades when the meaning of their inputs is eroded. To bridge this gap, we developed a framework to semantically erode meaning and quantify its intensity, grounded in discourse analysis, psycholinguistics, and software engineering, comprising five theoretically motivated methods: omission of key information and context, lexical substitution with near-synonyms, increased abstraction, structural obfuscation and renaming, and injection of logical errors. We applied these erosion operators across five domains and quantified their effects on model performance using a publicly available language model. A two-way Analysis of Variance (ANOVA) revealed significant main effects of both domain and erosion method, as well as a significant interaction, indicating that the impact of semantic degradation depends jointly on how text is eroded and how domain-specific information is encoded. Logical error erosions proved especially damaging for code generation, whereas structural obfuscation most strongly impaired news and instruction tasks. Epistasis analysis of pairwise erosion unions showed that some combinations produced super-additive degradation while others exhibited compensatory effects. These domain-by-erosion profiles provide diagnostic insight into where multi-step large language model (LLM) pipelines are most likely to fail and suggest that robustness benchmarks should probe models along domain-specific vulnerability dimensions rather than relying on generic perturbations. Semantic erosion thus offers a principled tool for turning model failure into evidence about how language models structure and degrade meaning.
Astrom et al. (Wed,) studied this question.