⚡ Open Model Pretraining Masterclass — Elie Bakouch, HuggingFace SmolLM 3, FineWeb, FinePDF
Key Moments
Hugging Face expert Elie Bakouch discusses open model pretraining, data quality, MoEs, and new optimizers.
Key Insights
The five pillars of model training are data quality, model architecture, information extraction efficiency, gradient quality, and training stability.
High-quality and diverse data is crucial ('garbage in, garbage out'), and datasets like FineWeb and FinePDF are improving data availability.
New optimizers like Muon and Shampoo are emerging as alternatives to Adam, aiming for better stability and learning efficiency.
Mixture of Experts (MoE) models offer performance gains for the same FLOPs but require careful attention to load balancing and specialization.
Data rephrasing and incorporating conversational/QA formats is key to improving model performance on benchmarks like MMLU.
Open-source efforts are vital for transparency, and Hugging Face aims to build models the community needs, including very small ones and MoE variants.
THE FIVE PILLARS OF MODEL TRAINING
Ellie Bakouch outlines a framework for effective model pretraining, centered on five core pillars. These include optimizing data quality and diversity, designing efficient model architectures, maximizing information extraction from data, ensuring high-quality gradients during optimization, and maintaining training stability at scale. This holistic view emphasizes that advancements are needed across multiple fronts, not just within a single area, to build powerful and reliable language models.
DATA QUALITY AND CURATION: FINEWEB AND FINDPDF
Data quality is paramount, often termed 'garbage in, garbage out.' Bakouch highlights the importance of high-quality and diverse datasets. Hugging Face's research efforts contribute significantly here, exemplified by datasets like FineWeb for web data and the more recent FinePDF, which addresses the under exploration of PDF content. These curated datasets aim to provide better raw material for training, leading to improved model performance, as demonstrated by their competitive results against other leading models.
ADVANCEMENTS IN OPTIMIZERS AND TRAINING STABILITY
The landscape of optimizers is evolving beyond the standard AdamW. Bakouch discusses new alternatives like Muon and Shampoo, which aim to offer improved learning stability and efficiency by better approximating the Hessian matrix or projecting into data space. Techniques like QK clip and methods to manage gradient divergence are crucial for preventing training explosions, especially at scale. Ensuring proper hyperparameter tuning, including learning rate and weight decay, is critical for realizing the benefits of these advanced optimizers.
MIXTURE OF EXPERTS (MOE) AND SPECIALIZATION
Mixture of Experts (MoE) models represent a significant trend, offering increased performance for a given computational budget (FLOPs). However, their effective implementation hinges on sophisticated load balancing to ensure all experts are utilized. Bakouch explains how computing statistics at the micro-batch level is crucial for proper expert activation, preventing scenarios where certain experts are underutilized. This allows for domain specialization, enabling potential pruning of less relevant experts and further efficiency gains.
THE ROLE OF DATA REPHRASING AND FORMATTING
Curating data is not just about content but also format. Rephrasing and ensuring data is in conversational or question-answering (QA) formats significantly boosts performance on benchmarks like MMLU. While high-quality data is beneficial, rephrasing low-quality data can also yield substantial improvements. There's an ongoing discussion about the trade-off between perfectly clean data and the need for models to handle real-world imperfections like typos, suggesting a balance is necessary for robust, general-purpose models.
OPEN SCIENCE AND COMMUNITY-DRIVEN DEVELOPMENT
Hugging Face's commitment to open science is central to their mission, aiming to provide models and tools that the community needs. They emphasize releasing checkpoints, codebases (like DataPro for data processing and LightEval for evaluation), and actively solicit feedback. This open approach fosters transparency in pre-training dynamics and allows researchers to build upon their work. Bakouch highlights the desire to explore more complex MoE architectures and distributed training, further democratizing advanced AI research.
Mentioned in This Episode
●Products
●Software & Apps
●Tools
●Companies
●Organizations
●Books
●Studies Cited
●Concepts
●People Referenced
Common Questions
Hugging Face's research team explores open science subjects, including data sets like FineWeb and FinePDF, and model training, focusing on pre-training and training models such as SmolLM.
Topics
Mentioned in this video
Previously discussed with Allesio, related to the vision team at IDFIX and Obelix.
A recent optimizer technique being explored, which uses a Newton-Raphson method for stability and better exploration.
Associated with schedule-free optimizer research.
Mentioned in the context of training on code and math with rephrasing at mid-training stages.
Another method mentioned that aims to patch the same issue as QK clip, but QK clip is considered more elegant.
A recently released data set by Hugging Face, which found PDF data to be greatly underexplored and achieved impressive performance when mixed with other web data.
A method used in optimizers to rescale parameters to prevent exploding gradients and attention logits, particularly in early training steps.
A high-quality math data set that could potentially be repeated more times due to its information density.
A model mentioned that can repeat data up to three epochs before diminishing returns.
A company offering a superset of Python for AI/ML development, potentially interesting for model training optimization.
A recent release from NVIDIA, used as a benchmark for comparison with FinePDF.
Mentioned as an example of a trillion-scale model like GT5 and Opus.
Mentioned in relation to Ugo Lawren and the vision team.
A type of schedule that was compared against schedule-free optimizers, but the latter did not perform as well in testing.
A recent paper from Datology discussing rephrasing and data quality.
Mentioned in relation to Ugo Lawren and the vision team.
A recent paper discussing new optimizer techniques.
A Hugging Face library for data filtering and tokenization at scale, considered underrated and used for FineWeb and FinePDF.
An MoE model that was initially slower to train than Mixtral without optimizations, but became feasible with new kernel and optimizations.
An example of a smaller model used for specific requests, contrasted with larger ones that activate more experts.
Associated with the 'beyond web' paper and Light eval.
A model mentioned as an example for its training data and MMLU performance, specifically its random performance on QA format before a certain token count.
An evaluation library from Hugging Face, considered a good tool and used by Datology in their BYOL paper.
A codebase developed by Hugging Face for large-scale training.
More from Latent Space
View all 62 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free