Abstract As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomenon we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models displaying significant reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. In this work, we show memorization to be a complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Igor Shilov
Igor Shilov
Yves-Alexandre de Montjoye
Nature Communications
Imperial College London
Building similarity graph...
Analyzing shared references across papers
Loading...
Shilov et al. (Thu,) studied this question.
www.synapsesocial.com/papers/6980fbe1c1c9540dea80da6f — DOI: https://doi.org/10.1038/s41467-026-68603-0
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: