Key Moments
⚡ Open Model Pretraining Masterclass — Elie Bakouch, HuggingFace SmolLM 3, FineWeb, FinePDF
Key Moments
Hugging Face expert Elie Bakouch discusses open model pretraining, data quality, MoEs, and new optimizers.
Key Insights
The five pillars of model training are data quality, model architecture, information extraction efficiency, gradient quality, and training stability.
High-quality and diverse data is crucial ('garbage in, garbage out'), and datasets like FineWeb and FinePDF are improving data availability.
New optimizers like Muon and Shampoo are emerging as alternatives to Adam, aiming for better stability and learning efficiency.
Mixture of Experts (MoE) models offer performance gains for the same FLOPs but require careful attention to load balancing and specialization.
Data rephrasing and incorporating conversational/QA formats is key to improving model performance on benchmarks like MMLU.
Open-source efforts are vital for transparency, and Hugging Face aims to build models the community needs, including very small ones and MoE variants.
THE FIVE PILLARS OF MODEL TRAINING
Ellie Bakouch outlines a framework for effective model pretraining, centered on five core pillars. These include optimizing data quality and diversity, designing efficient model architectures, maximizing information extraction from data, ensuring high-quality gradients during optimization, and maintaining training stability at scale. This holistic view emphasizes that advancements are needed across multiple fronts, not just within a single area, to build powerful and reliable language models.
DATA QUALITY AND CURATION: FINEWEB AND FINDPDF
Data quality is paramount, often termed 'garbage in, garbage out.' Bakouch highlights the importance of high-quality and diverse datasets. Hugging Face's research efforts contribute significantly here, exemplified by datasets like FineWeb for web data and the more recent FinePDF, which addresses the under exploration of PDF content. These curated datasets aim to provide better raw material for training, leading to improved model performance, as demonstrated by their competitive results against other leading models.
ADVANCEMENTS IN OPTIMIZERS AND TRAINING STABILITY
The landscape of optimizers is evolving beyond the standard AdamW. Bakouch discusses new alternatives like Muon and Shampoo, which aim to offer improved learning stability and efficiency by better approximating the Hessian matrix or projecting into data space. Techniques like QK clip and methods to manage gradient divergence are crucial for preventing training explosions, especially at scale. Ensuring proper hyperparameter tuning, including learning rate and weight decay, is critical for realizing the benefits of these advanced optimizers.
MIXTURE OF EXPERTS (MOE) AND SPECIALIZATION
Mixture of Experts (MoE) models represent a significant trend, offering increased performance for a given computational budget (FLOPs). However, their effective implementation hinges on sophisticated load balancing to ensure all experts are utilized. Bakouch explains how computing statistics at the micro-batch level is crucial for proper expert activation, preventing scenarios where certain experts are underutilized. This allows for domain specialization, enabling potential pruning of less relevant experts and further efficiency gains.
THE ROLE OF DATA REPHRASING AND FORMATTING
Curating data is not just about content but also format. Rephrasing and ensuring data is in conversational or question-answering (QA) formats significantly boosts performance on benchmarks like MMLU. While high-quality data is beneficial, rephrasing low-quality data can also yield substantial improvements. There's an ongoing discussion about the trade-off between perfectly clean data and the need for models to handle real-world imperfections like typos, suggesting a balance is necessary for robust, general-purpose models.
OPEN SCIENCE AND COMMUNITY-DRIVEN DEVELOPMENT
Hugging Face's commitment to open science is central to their mission, aiming to provide models and tools that the community needs. They emphasize releasing checkpoints, codebases (like DataPro for data processing and LightEval for evaluation), and actively solicit feedback. This open approach fosters transparency in pre-training dynamics and allows researchers to build upon their work. Bakouch highlights the desire to explore more complex MoE architectures and distributed training, further democratizing advanced AI research.
Mentioned in This Episode
●Products
●Software & Apps
●Tools
●Companies
●Organizations
●Books
●Studies Cited
●Concepts
●People Referenced
Common Questions
Hugging Face's research team explores open science subjects, including data sets like FineWeb and FinePDF, and model training, focusing on pre-training and training models such as SmolLM.
Topics
Mentioned in this video
Another method mentioned that aims to patch the same issue as QK clip, but QK clip is considered more elegant.
A recently released data set by Hugging Face, which found PDF data to be greatly underexplored and achieved impressive performance when mixed with other web data.
A method used in optimizers to rescale parameters to prevent exploding gradients and attention logits, particularly in early training steps.
A model mentioned that can repeat data up to three epochs before diminishing returns.
A recent release from NVIDIA, used as a benchmark for comparison with FinePDF.
A Hugging Face library for data filtering and tokenization at scale, considered underrated and used for FineWeb and FinePDF.
An MoE model that was initially slower to train than Mixtral without optimizations, but became feasible with new kernel and optimizations.
An example of a smaller model used for specific requests, contrasted with larger ones that activate more experts.
A model mentioned as an example for its training data and MMLU performance, specifically its random performance on QA format before a certain token count.
A codebase developed by Hugging Face for large-scale training.
A recent optimizer technique being explored, which uses a Newton-Raphson method for stability and better exploration.
Mentioned in the context of training on code and math with rephrasing at mid-training stages.
A high-quality math data set that could potentially be repeated more times due to its information density.
An evaluation library from Hugging Face, considered a good tool and used by Datology in their BYOL paper.
More from Latent Space
View all 201 summaries
38 minThe Stove Guy: Sam D'Amico Shows New AI Cooking Features on America's Most Powerful Stove at Impulse
55 minMistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample
36 min🔬There Is No AlphaFold for Materials — AI for Materials Discovery with Heather Kulik
65 minDreamer: the Agent OS for Everyone — David Singleton
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free