Key Moments
Best of 2024: Synthetic Data / Smol Models, Loubna Ben Allal, HuggingFace [LS Live! @ NeurIPS 2024]
Key Moments
Synthetic data is increasingly used in LLMs, with focus on small, efficient models for on-device applications.
Key Insights
Synthetic data is now integrated throughout the LLM pipeline, from pre-training to post-training, offering control and cost-efficiency over human annotation.
Concerns about model collapse due to synthetic data are being addressed; carefully curated synthetic data, especially when distilling knowledge from large models, does not necessarily degrade performance and can create richer datasets.
The 'textbooks are all you need' paradigm highlights the effectiveness of training smaller LMs on synthetic data, demonstrating that specialized synthetic datasets can yield superior results compared to general web data.
Rephrasing existing web content with LLMs is a successful method for generating large-scale, high-quality synthetic data, improving content and diversity.
Significant advancements in small language models (e.g., 3B parameters) show competitive performance on benchmarks, enabling on-device and edge AI applications with enhanced privacy.
The trend is shifting from solely scaling up models to focusing on efficiency and specialization, with increased emphasis on fine-tuning smaller, tailored models for specific tasks.
SYNTHETIC DATA INTEGRATION ACROSS THE LLM PIPELINE
Synthetic data has become a ubiquitous component in the large language model (LLM) pipeline, evolving from its initial use solely in post-training tasks. Initially, human annotators were essential for teaching models instruction following, helpfulness, and safety. As LLMs improved, synthetic data generated by these models began replacing human annotations. This year, synthetic data has advanced into the pre-training phase, offering greater control over data generation compared to filtering the web. This allows for the creation of custom datasets resembling ideal web pages, enabling entirely synthetic training pipelines from pre-training to instruction tuning and evaluation.
ADDRESSING MODEL COLLAPSE AND DATA QUALITY
Concerns surrounding 'model collapse,' where an over-reliance on synthetic data might degrade model performance, are a significant topic. Studies suggesting web pollution from synthetic data are often based on small-scale, iterative training loops. However, research indicates that when large, capable models are used to distill knowledge into smaller ones, synthetic data can actually enhance performance. Moreover, analysis of web data has shown that the increased presence of synthetic data markers has not led to a decline in model performance, suggesting the web might be enriched rather than polluted. Careful curation and recreating data properly are key to mitigating collapse risks.
THE 'TEXTBOOKS ARE ALL YOU NEED' PARADIGM AND DATA GENERATION
The concept of training models on synthetic textbooks, popularized by papers like 'Textbooks Are All You Need,' has shown that small models trained on high-quality synthetic data can outperform much larger models trained on general web data. Replicating these results, Hugging Face created the 'Cosmop' dataset, emphasizing diversity through prompt seeding and incorporating web page extracts to maintain topic relevance. Different generation styles, such as college versus middle school textbooks, impact performance on specific benchmarks, highlighting the nuanced approach required for effective synthetic data generation.
REPHRASING THE WEB AND ENHANCING DATA QUALITY
Rephrasing existing web content with LLMs has emerged as a powerful technique for generating massive synthetic datasets. Approaches like NVIDIA's 'Neurontron CC' rewrite common crawl pages to improve quality or create diverse formats like Q&A pairs and Wikipedia passages. This method is cost-effective as it requires less advanced models for rewriting than for generating entirely new content from scratch. Another method, 'PROCS,' uses programs to normalize and clean web pages, though it may be less scalable than direct rephrasing.
IMPROVING DATA FILTERING AND CLASS-SPECIFIC DATASETS
Building better classifiers for filtering web data is crucial for pre-training. Hugging Face's 'FineWeb-EDU' dataset, for instance, used LLaMA 3 to rate educational content, filtering down from 15 trillion tokens to 1.5 trillion high-scoring tokens, which significantly outperformed other public datasets. Similar approaches, like the 'DCLCM' dataset and 'Neurontron CW' which uses an ensemble of classifiers, demonstrate the effectiveness of using synthetic annotations to train classifiers for selecting high-quality, information-dense data for LLM training.
ADVANCEMENTS IN POST-TRAINING DATA AND DIVERSIFICATION
Post-training with synthetic data continues to see innovation, with datasets like Microsoft's 'Agent Instruct' targeting specific skills such as code generation and open-domain Q&A, showing improved performance over general instruct models. The '23SFT-Mixture' utilizes diverse personas from 'PersonaHub' to generate varied instruction data. Hugging Face's 'SmallTalk' dataset also aims for broad task coverage. Cohere's 'Multilingual Data Arbitrage' showcases using multiple teacher models and a router with a reward model to create diverse, high-quality multilingual instruction datasets, demonstrating that pulling from various strong models yields superior results.
TOWARDS EFFICIENT AND SMALLER MODELS
The landscape of LLMs in 2024 is marked by significant progress in small, on-device models. Models like Llama 3.2 1B now rival larger previous-generation models on benchmarks like the LMS Arena. Furthermore, 3B and 4B parameter models are achieving high scores on benchmarks like MMLU, challenging the notion that only massive models are effective. This efficiency is paving the way for applications that can run directly on consumer hardware like smartphones, enhancing privacy and reducing inference costs and latency, moving away from the sole focus on scaling up model size.
ON-DEVICE AI AND SPECIALIZATION TRENDS
The capability to run sophisticated LLMs on devices like iPhones is becoming a reality, with apps like 'PocketPal' allowing users to chat with models directly. This trend enhances privacy as data remains local. The focus is shifting towards building more efficient models that can match the performance of larger counterparts, unlocking use cases for on-device AI. Training smaller models for longer durations, as seen with Llama 3's extensive token count, is proving more effective than simply increasing model size. Specializing smaller models through fine-tuning for specific tasks like text extraction is also demonstrating competitive performance against larger, generalist models.
FUTURE HORIZONS: DOMAIN-SPECIFIC DATA AND FINE-TUNING
Looking ahead, the importance of domain-specific synthetic data, particularly for complex reasoning tasks like mathematics, will continue to grow. Similarly, specializing smaller models through fine-tuning will become increasingly vital. Companies can achieve high performance on specific tasks with smaller, more cost-effective models, avoiding the high costs associated with larger, general-purpose models. This trend extends beyond text to other modalities like vision and audio. The development of on-device frameworks and applications is expected to accelerate, making AI more accessible and private.
THE CYCLICAL NATURE OF AI DEVELOPMENT: FINE-TUNING'S RETURN
The evolution of AI development often follows cyclical patterns. After an initial phase focused on fine-tuning foundational models like BERT, the trend shifted towards prompt engineering with much larger models. However, as the cost and complexity of using massive models become apparent, there's a resurgence in fine-tuning. This approach allows for specialization and efficiency, making it more practical to use smaller, fine-tuned models for specific tasks. This return to fine-tuning suggests a more balanced and sustainable path for AI development, focusing on tailored solutions rather than just brute-force scaling.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Books
●Studies Cited
Synthetic Data & Small Models Cheat Sheet
Practical takeaways from this episode
Do This
Avoid This
Textbook Generation Style Performance Comparison
Data extracted from this episode
| Generation Style | Benchmark Performance |
|---|---|
| College textbooks | Strong performance on MMLU |
| Middle school textbooks | Strong performance on Open Book QA and PiQA |
Common Questions
Synthetic data initially used for post-training to guide instruction following and safety. In 2024, it's now integrated throughout the pipeline, including pre-training, to control data generation and potentially replace parts of the web.
Topics
Mentioned in this video
Mentioned as a source of synthetic data, with its release correlating to an increase in proxy words indicative of AI generation.
A series of best-in-class models, including a 1.7B parameter model that outperforms LLaMA 1B and QuIP 2.5; trained on 11 trillion tokens.
A dataset mentioned in the context of generating domain-specific synthetic data for math to improve model reasoning.
A dataset used to train a classifier for the DCLM dataset, focusing on instruction tuning.
A recent high-quality instruction tuning dataset that uses personas from Persona Hub to ensure diversity.
A dataset released by the speaker that covers a wide range of tasks and improves performance on benchmarks like mathematics.
An iPhone app that allows users to chat with various small models from Hugging Face, demonstrating on-device LLM capabilities.
A 0.5B parameter vision-language model that shows minimal trade-off compared to its 2B counterpart.
A framework for on-device inference of small models.
A framework enabling on-device inference for small models.
A model that had lower scores on MMLU compared to newer 3B models.
Previous version of Meta's models, trained on 1 trillion tokens; contrasted with LLaMA 3's longer training.
A dataset created by filtering FineWeb for highly educational content using LLaMA 3 for annotations and a classifier.
A Microsoft dataset designed to improve specific model skills like code reasoning and open-domain QA, outperforming original instruct models.
A small model that matches the performance of LLaMA 2 13B on the LMS Arena.
A leaderboard for evaluating models using human evaluation, where LLaMA 3 1B matches LLaMA 2 13B.
Small VLM is a 2B parameter vision model known for its efficiency and good performance.
One of several frameworks enabling on-device inference for small models.
A JavaScript library that facilitates on-device inference for small models in the browser.
A model from the previous year whose performance is matched by LLaMA 3 1B.
A model mentioned in the context of high-performing small models, with its blog post showing 3B and 4B parameter models scoring high on MMLU.
A smaller model outperformed by SmallM2's 1.7B version.
A dataset from which samples were rewritten in the Pratus paper to improve format and quality.
A model where Meta increased pre-training length to 15 trillion tokens, achieving better performance for the same size compared to LLaMA.
A model that was fine-tuned on the Agent Instruct dataset, showing improved performance.
A dataset containing over a million personas, used to ensure diversity in the 23SFT mixture dataset.
The family of models is used as an example of scaling laws, showing performance improvements with increased parameter count.
Features both server-side and on-device models (3B parameters), trained using pruning and distillation, with a focus on privacy.
A model on which SmallM2's 1.7B version shows superior performance.
A paper that suggested rewriting samples from C4 datasets into better formats like Wikipedia passages or Q&A pages using LLMs.
A paper by Meta that studies models under 1 billion parameters, finding depth is more important than width and that GQA helps.
A dataset created by training a classifier on OpenHermes data for high-quality, information-dense LLM training.
Implemented an ensemble of classifiers, including DCLM and FineWeb-Edu classifiers, to create a high-quality dataset.
A paper by Cohere that addresses multilingual dataset generation by using multiple teacher models and a reward model to select the best completions.
A startup that fine-tuned 'Small L' for text extraction, achieving performance close to much larger models.
An older model mentioned as an example of early fine-tuning efforts before larger models led to a shift towards prompt engineering.
Mentioned for their 'Textbooks Are All You Need' paper and the Agent Instruct dataset.
Authored the 'Multilingual Data Arbitrage' paper.
Mentioned for their 'Neuron CC' paper which generated 1.9 trillion tokens through rephrasing web content.
Used as an example for structured generation, demonstrating how to extract key information from issues into a properly formatted ticket.
More from Latent Space
View all 134 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free