What question did this study set out to answer?

The study aims to explore how large language models memorize training data, focusing on the concept of mosaic memory.

February 2, 2026Open Access

The mosaic memory of large language models

Key Points

The study aims to explore how large language models memorize training data, focusing on the concept of mosaic memory.
Analysis of memorization processes in major large language models
Evaluation of the role of fuzzy duplicates and modified sequences in memorization
Comparison of syntactic versus semantic memorization
Assessment of the prevalence of fuzzy duplicates in real-world data.
LLMs demonstrated mosaic memory by assembling information from similar sequences.
Fuzzy duplicates contributed substantially to memorization, comparable to exact duplicates in effect.
Memorization was found to be primarily syntactic rather than semantic.
Fuzzy duplicates were prevalent in real-world data, even after deduplication efforts.

Abstract

Abstract As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomenon we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models displaying significant reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. In this work, we show memorization to be a complex, mosaic process, with real-world implications for privacy, confidentiality, model utility and evaluation.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper