What is meant by the duplication level in model training?

The speaker mentions a 'duplication level' concept as something that matters at each training stage. It’s described as not easy to measure and relevant to how memorization emerges during pre-training and post-training phases (timestamps 45–52).

How does continued pre-training affect the model compared to initial pre-training?

Continued pre-training requires revisiting old data to avoid forgetting, and the balance between pre-training and post-training affects how facts are retained. This is discussed around the segment starting at ~68s.

How much data needs to be used for continued pre-training to memorize mostly everything?

The speaker notes that even with a small fraction of data (often one or two 5%) used for continued pre-training, the model can memorize almost everything, which is described around 81–92 seconds.

Is memorization considered an intentional outcome or an emergent property?

Memorization is framed as an emergent property arising from next-token prediction and training dynamics rather than a deliberate objective, highlighted in multiple lines from 1s to 36s and again at 68s.

Why is memorization fascinating according to the speaker?

The speaker repeatedly highlights how surprising and fascinating it is that models can memorize and recall from limited exposure, and attributes this to training practices and model capacity, with emphasis near the 68–94 second marks.

Key Moments

How models memorize from one pass #substack #shorts

Latent Space Podcast

Science & Technology3 min read2 min video

Feb 26, 2026|618 views|4|1

shorts substack

Save to Pod

Key Moments

TL;DR

Even a small data fraction can let models memorize; balance pretraining vs forgetting.

Key Insights

Memorization is a natural byproduct of next-token prediction, not a flaw.

A tiny exposure (as little as 1–5%) can lead to durable memorization of content.

There is a delicate balance between pretraining, continued pretraining, and forgetting.

Memorized content can persist across training stages, but old data must be revisited to stay stable.

Measuring memory is complex; a single metric cannot fully capture the memory landscape.

Training design should account for how memorization interacts with learning new information.

MEMORIZATION AS A DESIGN CONSEQUENCE

Memorization is a natural outcome of the next-token prediction objective. When a model predicts the next token, it implicitly encodes patterns, sequences, and even exact strings it has encountered. With immense capacity and exposure to vast data, the model can store information from training data even after a single pass through parts of the corpus. This isn't a flaw but a feature of how language models learn and encode statistical structure.

LIMITED EXPOSURE, SIGNIFICANT MEMORIZATION

Even with a tiny fraction of data—often just 1% to 5% of the corpus—the model can memorize enough to recall or reproduce content when prompted. Because next-token training rewards patterns that repeat or align with training text, small, repeated exposures can leave a durable imprint. The result is a surprising ability to provide near-perfect renditions or summaries of memorized material, illustrating how memory scales with data and capacity.

DUPLICATION LEVEL AND FORGETTING ACROSS TRAINING STAGES

Training involves two faces of memory: memorized content and generalizable knowledge. There appears to be an optimal 'duplication' level at each stage; too much pre-training can cause the model to forget basic facts, while post-training adjustments can nudge capabilities closer to learned advantages. The process means that as training proceeds, the balance shifts: early data might be at risk of forgetting, whereas later fine-tuning helps preserve or enhance it, creating a dynamic memory landscape.

CONTINUED PRE-TRAINING REQUIRES REVISITING OLD DATA

Continued pre-training typically requires revisiting old material to avoid forgetting. Even when the majority of data is new, periodic re-encounters with earlier content reinforce it and stabilize it. The practice underscores that learning in large models is not a one-way street: memory must be refreshed to prevent drifting away from established facts, while still allowing the model to adapt to new information and patterns.

MEMORY MEASUREMENT CHALLENGES: THE KNOWLEDGE TANK

There's no single metric that fully captures model memory. The idea of a 'memory tank' or knowledge reservoir describes how memorized information sits alongside general ability, but its contents can be invisible to standard evaluations. This makes it hard to quantify how much is memorized, what is retained after further training, and how much originality remains when recalling from memory.

IMPLICATIONS FOR TRAINING PRACTICES

These observations push practitioners to design training schedules that balance memorization and forgetting. Dataset curation, pacing of pre-training versus fine-tuning, and careful monitoring of recall vs. novelty become essential. Understanding that small data fractions can seed long-term memory motivates deliberate data reuse strategies and safeguards to control unintended memorization, shaping how we evaluate models, mitigate leakage, and guide development toward robust, reliable behavior.

Common Questions

Yes. The speaker notes that a model can memorize and render a perfect recap of information after seeing it only once or twice in the training corpus, due to the way next-token prediction works. This is discussed around the early parts of the video (timestamp ~1s to ~34s).