Why is the FinePDF dataset important?

FinePDF addresses the underexplored area of PDF data, achieving impressive performance when mixed with web data, demonstrating the value of diverse data sources.

What are the key objectives in model training?

Model training involves maximizing data relevance and quality, optimizing model architecture for efficiency, maximizing information extraction per step (e.g., via distillation), improving gradient quality, and finding the right hyperparameters for stability and scaling.

Which aspects of model training are still underexplored?

Data set exploration and finding new optimizers for stability and scaling are considered underexplored areas. While model architectures are advanced, details on optimizers and scaling from papers like those by Quen and DeepSea are not always fully disclosed.

What is the significance of new optimizers like King K2?

New optimizers like King K2, influenced by Newton-Raphson methods, offer improved stability and encourage learning by spreading the learning process across more dimensions compared to traditional optimizers like AdamW.

How does Mixture-of-Experts (MoE) improve performance?

MoE models can offer better performance for the same number of FLOPS by routing tokens to specialized experts. However, effective load balancing and training stability are crucial for realizing these benefits.

What is the role of load balancing in MoE models?

Load balancing ensures that each expert receives a similar number of tokens, preventing some experts from becoming underutilized. Calculating statistics at the micro-batch level is critical for effective load balancing.

Can MoE experts specialize in specific domains like coding or literature?

Yes, MoE experts can specialize, but this is highly dependent on how the training data is presented and how load balancing is performed. Global batch statistics often lead to better specialization than local batch statistics.

How important is data quality for model performance?

Data quality is paramount. Rephrasing low-quality web data can significantly boost performance. However, overly clean data might hinder the model's ability to handle real-world imperfections like typos.

What are the benefits of conversational data in training?

Including conversational data, like Q&A formats, in the training set leads to significantly better performance on benchmarks like MMLU, moving beyond random guessing and indicating the model has learned to answer questions effectively.

What are Hugging Face's open-source contributions?

Hugging Face open-sources their codebase, including tools like Nanot for large-scale training and DataPro for data filtering and tokenization. They also provide intermediate checkpoints for research.

What is the future direction for small LMs?

Hugging Face aims to build models the community needs, encouraging feedback for gaps like very small models (<1B parameters). They are also exploring training MoE models and distributed training across data centers.

Key Moments

⚡ Open Model Pretraining Masterclass — Elie Bakouch, HuggingFace SmolLM 3, FineWeb, FinePDF

Q: What is Hugging Face's research team focused on?

Hugging Face's research team explores open science subjects, including data sets like FineWeb and FinePDF, and model training, focusing on pre-training and training models such as SmolLM.

Latent Space Podcast

Science & Technology3 min read64 min video

Oct 20, 2025|5,608 views|157|10

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Hugging Face expert Elie Bakouch discusses open model pretraining, data quality, MoEs, and new optimizers.

Key Insights

The five pillars of model training are data quality, model architecture, information extraction efficiency, gradient quality, and training stability.

High-quality and diverse data is crucial ('garbage in, garbage out'), and datasets like FineWeb and FinePDF are improving data availability.

New optimizers like Muon and Shampoo are emerging as alternatives to Adam, aiming for better stability and learning efficiency.

Mixture of Experts (MoE) models offer performance gains for the same FLOPs but require careful attention to load balancing and specialization.

Data rephrasing and incorporating conversational/QA formats is key to improving model performance on benchmarks like MMLU.

Open-source efforts are vital for transparency, and Hugging Face aims to build models the community needs, including very small ones and MoE variants.

THE FIVE PILLARS OF MODEL TRAINING

Ellie Bakouch outlines a framework for effective model pretraining, centered on five core pillars. These include optimizing data quality and diversity, designing efficient model architectures, maximizing information extraction from data, ensuring high-quality gradients during optimization, and maintaining training stability at scale. This holistic view emphasizes that advancements are needed across multiple fronts, not just within a single area, to build powerful and reliable language models.

DATA QUALITY AND CURATION: FINEWEB AND FINDPDF

Data quality is paramount, often termed 'garbage in, garbage out.' Bakouch highlights the importance of high-quality and diverse datasets. Hugging Face's research efforts contribute significantly here, exemplified by datasets like FineWeb for web data and the more recent FinePDF, which addresses the under exploration of PDF content. These curated datasets aim to provide better raw material for training, leading to improved model performance, as demonstrated by their competitive results against other leading models.

ADVANCEMENTS IN OPTIMIZERS AND TRAINING STABILITY

The landscape of optimizers is evolving beyond the standard AdamW. Bakouch discusses new alternatives like Muon and Shampoo, which aim to offer improved learning stability and efficiency by better approximating the Hessian matrix or projecting into data space. Techniques like QK clip and methods to manage gradient divergence are crucial for preventing training explosions, especially at scale. Ensuring proper hyperparameter tuning, including learning rate and weight decay, is critical for realizing the benefits of these advanced optimizers.

MIXTURE OF EXPERTS (MOE) AND SPECIALIZATION

Mixture of Experts (MoE) models represent a significant trend, offering increased performance for a given computational budget (FLOPs). However, their effective implementation hinges on sophisticated load balancing to ensure all experts are utilized. Bakouch explains how computing statistics at the micro-batch level is crucial for proper expert activation, preventing scenarios where certain experts are underutilized. This allows for domain specialization, enabling potential pruning of less relevant experts and further efficiency gains.

THE ROLE OF DATA REPHRASING AND FORMATTING

Curating data is not just about content but also format. Rephrasing and ensuring data is in conversational or question-answering (QA) formats significantly boosts performance on benchmarks like MMLU. While high-quality data is beneficial, rephrasing low-quality data can also yield substantial improvements. There's an ongoing discussion about the trade-off between perfectly clean data and the need for models to handle real-world imperfections like typos, suggesting a balance is necessary for robust, general-purpose models.

OPEN SCIENCE AND COMMUNITY-DRIVEN DEVELOPMENT

Hugging Face's commitment to open science is central to their mission, aiming to provide models and tools that the community needs. They emphasize releasing checkpoints, codebases (like DataPro for data processing and LightEval for evaluation), and actively solicit feedback. This open approach fosters transparency in pre-training dynamics and allows researchers to build upon their work. Bakouch highlights the desire to explore more complex MoE architectures and distributed training, further democratizing advanced AI research.

Mentioned in This Episode

●Products

●Software & Apps

●Tools

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

●People Referenced

Common Questions

Hugging Face's research team explores open science subjects, including data sets like FineWeb and FinePDF, and model training, focusing on pre-training and training models such as SmolLM.

Topics

SmolLM FinePDF FineWeb Model Efficiency Sparsity

Mentioned in this video

People

Aaron DeFazio

Associated with schedule-free optimizer research.

Ugo Lawren

Previously discussed with Allesio, related to the vision team at IDFIX and Obelix.

Software & Apps

QK norm

Another method mentioned that aims to patch the same issue as QK clip, but QK clip is considered more elegant.

FinePDF

A recently released data set by Hugging Face, which found PDF data to be greatly underexplored and achieved impressive performance when mixed with other web data.

QK clip

A method used in optimizers to rescale parameters to prevent exploding gradients and attention logits, particularly in early training steps.

Lumina

A model mentioned that can repeat data up to three epochs before diminishing returns.

Neuronb2

A recent release from NVIDIA, used as a benchmark for comparison with FinePDF.

DataPro

A Hugging Face library for data filtering and tokenization at scale, considered underrated and used for FineWeb and FinePDF.

Qwen 3

An MoE model that was initially slower to train than Mixtral without optimizations, but became feasible with new kernel and optimizations.

GPT-5 mini

An example of a smaller model used for specific requests, contrasted with larger ones that activate more experts.

SmolLM 2

A model mentioned as an example for its training data and MMLU performance, specifically its random performance on QA format before a certain token count.

Nanot

A codebase developed by Hugging Face for large-scale training.

King K2

A recent optimizer technique being explored, which uses a Newton-Raphson method for stability and better exploration.

GNN

Mentioned in the context of training on code and math with rephrasing at mid-training stages.

FineMath

A high-quality math data set that could potentially be repeated more times due to its information density.

Light Eval

An evaluation library from Hugging Face, considered a good tool and used by Datology in their BYOL paper.

Media

CTO 5

Mentioned as an example of a trillion-scale model like GT5 and Opus.

Organizations

IDFIX

Mentioned in relation to Ugo Lawren and the vision team.

Obelix

Mentioned in relation to Ugo Lawren and the vision team.

Datology

Associated with the 'beyond web' paper and Light eval.

Concepts

WSD schedule

A type of schedule that was compared against schedule-free optimizers, but the latter did not perform as well in testing.

A recent paper from Datology discussing rephrasing and data quality.

GM paper

A recent paper discussing new optimizer techniques.

Companies

Modular

A company offering a superset of Python for AI/ML development, potentially interesting for model training optimization.

AMD

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free