Key Moments

The Magic of LLM Distillation — Rishabh Agarwal, Google DeepMind

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read47 min video
Mar 23, 2025|8,221 views|277|7
Save to Pod
TL;DR

LLM distillation: From logits to synthetic data and RL-inspired methods for efficient model deployment.

Key Insights

1

Distillation's primary goal is transferring knowledge from a larger 'teacher' model to a smaller 'student' model to reduce cost and enhance deployability.

2

Traditional distillation (matching logits) is effective but has limitations, leading to methods like synthetic data generation and RL-inspired approaches.

3

Synthetic data distillation involves generating outputs from a teacher model and fine-tuning the student on this data, offering simplicity and API accessibility.

4

RL-inspired distillation, particularly on-policy methods, addresses train-test mismatch inherent in autoregressive generation by sampling from the student.

5

Speculative decoding can be significantly sped up by using a distilled student model that closely mimics the teacher's behavior, enabling faster inference.

6

The choice of distillation method depends on task requirements, available resources (like logits access), and computational budget, with a trade-off between performance, diversity, and complexity.

THE EVOLVING LANDSCAPE OF DISTILLATION

The concept of model distillation, originating from Jeff Hinton's work in 2015, has significantly evolved beyond its initial application to classifiers. The core idea remains transferring knowledge from a larger, more capable 'teacher' model to a smaller, more efficient 'student' model. This is crucial for practical deployment, especially in resource-constrained environments like smartphones, and addresses the cost-performance trade-off, aiming for high performance at a low computational cost. The field's progress, particularly with the advent of large language models (LLMs), has spurred new distillation techniques beyond the traditional methods.

TRADITIONAL LOGIT MATCHING AND ITS LIMITATIONS

The foundational distillation method involves matching the output distributions (logits) of the teacher and student models on a given input. This approach generalizes next-token prediction by using 'soft' probability distributions over all possible tokens instead of a single 'hard' token prediction. While principled and effective, this method requires access to the teacher's logits, which may not always be available, especially for black-box models. Furthermore, the data generation process in traditional distillation differs from the autoregressive inference process, leading to a potential train-test mismatch.

SYNTHETIC DATA DISTILLATION: SIMPLICITY AND ACCESSIBILITY

A widely adopted distillation strategy involves generating synthetic data. This method uses a teacher model to produce outputs for a set of prompts, and then the student model is fine-tuned on this generated data, often using supervised fine-tuning (SFT). A notable enhancement is 'best of N,' where multiple outputs are generated and the best ones are selected, filtering for correctness or quality. This approach is highly practical as it primarily requires API access to the teacher model and does not necessitate access to logits, making it accessible for distilling from various models, including those with different tokenizers.

RL-INSPIRED AND ON-POLICY DISTILLATION

To address the train-test mismatch inherent in autoregressive generation, Reinforcement Learning (RL) inspired distillation methods have emerged. On-policy distillation, such as dagger, involves sampling sequences from the student model itself and then using the teacher model to provide feedback or corrections. This alignment between training (student sampling) and testing (student autoregressive generation) reduces mismatch. Mathematically, this can be derived from KL divergence objectives by swapping the roles of the teacher and student, proving its principled nature and mitigating issues where the student might deviate significantly from the teacher's learned distribution.

IMPROVING EFFICIENCY WITH SPECULATIVE DECODING

Distillation plays a crucial role in enhancing the efficiency of large models, particularly through speculative decoding. In this technique, a smaller, distilled 'student' model generates multiple token proposals, which a larger 'teacher' model then verifies. If the teacher accepts the proposals, inference can be significantly accelerated as multiple tokens are generated in parallel. The effectiveness hinges on the student's ability to mimic the teacher; distillation precisely improves this mimicry, allowing for faster speculative decoding. This dual benefit—creating a better student and speeding up the teacher—makes distillation a valuable strategy for deployment.

NAVIGATING TRADE-OFFS AND FUTURE DIRECTIONS

Choosing the right distillation method involves considering trade-offs between computational cost, performance, and data availability. Offline methods are simpler and can be faster to set up, while online (on-policy) methods, though more compute-intensive during training, better manage train-test mismatch and are often more effective for long-horizon or agentic tasks. The core dilemma often lies between using readily available synthetic data or leveraging more complex methods like logit matching or on-policy sampling, which may offer superior performance but demand more resources or expertise. Research continues to push the boundaries, exploring how to maximize knowledge transfer and efficiency.

Distillation Best Practices

Practical takeaways from this episode

Do This

Consider distillation as a deployment technique to make models practical.
Experiment with different distillation methods (logits, synthetic data, RL-inspired) to find what works best for your task and resources.
Leverage existing RL frameworks by adapting them for distillation.
If possible, use logits for distillation as they contain more information.
For long-horizon or agentic tasks, on-policy distillation might be more optimal.
Start with simpler methods like synthetic data distillation, as they often provide significant gains.
When using synthetic data, consider methods like 'best-of-N' or MBR for further optimization.
For tasks requiring a balance of performance and diversity, experiment with mixing KL divergence objectives.
Every large model can benefit from a smaller distilled student model for faster serving and potential speculative decoding.
Continuously question if current methods are suboptimal and explore newer, more effective techniques.

Avoid This

Don't assume the original Hinton distillation using logits is the only or best method, especially for LLMs.
Don't overlook the potential of using smaller models to generate synthetic data for distilling into larger models (compute/cost-matched settings).
Don't dismiss RL-inspired distillation methods, as they can address train-test mismatches.
Don't discard the possibility of improving a larger model by distilling data generated from a smaller, more numerous source.
Don't ignore the trade-off between performance and diversity when choosing distillation objectives.
Don't stick to old methods if newer, more effective ones are available and feasible.

Common Questions

LLM distillation is a technique to transfer knowledge from a large, expensive 'teacher' model to a smaller, more efficient 'student' model. It's important for making powerful AI models deployable and cost-effective for various applications, especially on resource-constrained devices.

Topics

Mentioned in this video

More from Latent Space

View all 108 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free