Better Data is All You Need — Ari Morcos, Datology
Key Moments
Data curation, not just models, is key to AI advancement. Better data yields faster, better, smaller models.
Key Insights
Data curation is the most impactful yet underinvested area in AI, crucial for training faster, better, and smaller models.
The prevailing focus on model architecture and compute scaling overlooks the fundamental principle that 'models are what they eat' – high-quality data is paramount.
Effective data curation involves sophisticated techniques like filtering, rebalancing, sequencing (curriculum learning), and synthetic data generation.
While industry values data work, the research community has historically treated it as a 'second-class citizen,' hindering progress.
Self-supervised learning has enabled massive scaling of data quantity, shifting the focus from data scarcity to managing massive, unlabeled datasets with issues like redundancy and low quality.
Automated data curation is essential, as humans cannot scale to process trillions of tokens, nor are they objectively good at identifying optimal data points due to their inability to consider the entire dataset context.
THE PARADOMOF DATA-DRIVEN AI
Current AI development heavily emphasizes model architecture and compute power, often at the expense of data quality. Ari Morcos argues that this focus is misguided, as "models are what they eat." High-quality, curated data is the true determinant of model performance, enabling models to be trained faster, achieve better results, and even be smaller in parameter count. Datology's mission is to democratize access to state-of-the-art data curation, making sophisticated techniques accessible without requiring specialized expertise.
FROM NEUROSCIENCE TO DATA SCIENCE
Morcos's journey into AI began with a background in neuroscience, where machine learning was initially a tool for analyzing complex neural data. This empirical, science-first approach led him to view deep learning as an empirical science, distinct from more theory-driven computer science disciplines. His early research aimed to understand *why* certain representations or model behaviors were effective, but he found a persistent disconnect between scientific understanding and practical application, a gap that ultimately led him to focus on data.
THE BITTER LESSON OF INDUCTIVE BIASES
Through several key research papers around 2020, Morcos experienced what he termed the "bitter lesson": inductive biases, once thought crucial for model performance, become largely irrelevant at scale. Experiments showed that while soft inductive biases (like initializing a Vision Transformer as a CNN) were helpful in small data regimes, they became detrimental as data size increased beyond a million points. This led to the realization that for large-scale models, the learned posterior from the data distribution is paramount, not architectural priors.
DATA AS THE UNDERINVESTED FRONTIER
Following the bitter lesson, Morcos identified two primary paths forward: improving hardware (GPU performance) or focusing on data. He advocates for data, citing it as dramatically underinvested relative to its impact. Many scaling law studies, for instance, naively assume Independent and Identically Distributed (IID) data, ignoring the well-known principle of "garbage in, garbage out." This underinvestment is partly due to cultural perceptions of data work as 'grunt work' and historical research incentives that treated datasets as fixed entities.
THE SHIFT TO SELF-SUPERVISED LEARNING AND DATA CHALLENGES
The advent of self-supervised learning in 2019 was a paradigm shift, enabling training on unlabeled data at unprecedented scales—trillions of tokens compared to datasets like ImageNet's millions. This move from data scarcity to data abundance created new challenges: redundancy, low quality, and low information gain within massive datasets. Consequently, models now often underfit the data rather than overfitting, making data curation—beyond simple filtering—critical for extracting maximum value and efficiency.
CURATION BEYOND FILTERING
Data curation encompasses more than just removing bad data points. it involves rebalancing datasets, upsampling or downsampling distributions, sequencing data through curricula, intelligent batching, and generating synthetic data. While filtering remains important, these broader curation strategies are essential for optimizing model training. Morcos emphasizes that understanding the value of a data point is not isolated but dependent on its relation to the entire training set, a task humans are ill-equipped to perform at scale.
AUTOMATION AND THE LIMITS OF HUMAN EXPERTISE
Studies, like one involving graduate students attempting to predict data classification outcomes, show that even human experts struggle to identify optimal data points for model training. Humans cannot mentally process vast datasets or grasp the complex interdependencies between data points. Therefore, automated data curation systems are not just a matter of scale but a necessity, as automated systems can consider these holistic relationships far more effectively than humans.
SYNTHETIC DATA: REPHRASING VS. DISTILLATION
Synthetic data generation is a key aspect of curation, with two primary approaches. Distillation involves generating new data largely from the generating model's knowledge, risking model collapse and mode-centric outputs. The preferred method, 'rephrasing' or rewriting, leverages existing data, with the model reformulating it into a more learnable or accessible format. This approach, where information originates from the source data, allows for the creation of better models than the generative ones, especially when combined with techniques like rejection sampling and careful filtering.
CURRICULUM LEARNING AND TRAINING STAGES
Curriculum learning, once doubted, is regaining prominence. The idea stems from the natural dependency of concepts, making ordered learning more efficient. While previously less critical due to data saturation in supervised learning, it is now vital in the era of underfitting models. Curriculum learning, including discrete curricula and continuous 'mid-training' and post-training phases, allows for more efficient use of compute and can significantly impact model performance and cost-effectiveness. Optimizing pre-training data to enhance post-training effectiveness is a key emerging area.
THE ECONOMICS OF BETTER DATA
Data curation acts as a compute multiplier, enhancing the value of compute resources. While 'train faster' offers immediate cost savings, the primary driver for most is 'train better'—achieving superior performance for the same compute budget. For advanced enterprises, 'train smaller' becomes paramount, drastically reducing inference costs, which dominate the total cost of ownership. Creating highly specialized, smaller models ('inch wide, mile deep') is more economical than using oversized general-purpose models.
DATA VS. PRUNING AND MODEL SIZE
While model pruning techniques like the lottery ticket hypothesis aimed to reduce model size, their effectiveness is often data-dependent and challenging to implement efficiently, especially with unstructured pruning. Morcos suggests that better data curation, leading to smaller, high-performing models from the outset, is more robust and complementary to other optimization methods like pruning and quantization. The trend suggests that future models will be significantly smaller, perhaps single-digit billions of parameters, optimizing inference cost and specialized task performance.
THE FUTURE: DATA AS THE ULTIMATE MOAT
Datology's long-term vision is to be the best in the world at valuing data relative to downstream use cases, treating data curation as the core competency. This involves automating adaptations to novel data distributions and tailoring curation to specific tasks. While open-source datasets provide foundational value, the true competitive advantage—the 'moat'—lies in proprietary data curation know-how and specialized infrastructure. The company aims to lower the barrier to high-quality data, enabling anyone to train effective models on their first try, moving beyond the limitations of current practices and raw data sources.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Books
●Studies Cited
●People Referenced
Common Questions
Dtology is a company focused on curating data for machine learning. Their mission is to make the data side of ML accessible and efficient, enabling users to train models faster, achieve better performance, and create smaller, more capable models.
Topics
Mentioned in this video
Scaling laws research that assumed IID data, which Ari Morcos critiques for overlooking data quality's impact.
A startup focused on data curation for machine learning to help train models faster, better, and smaller.
A data curation project that showed human experts could not predict the filtering criteria of their automated system, highlighting the limitations of human judgment in data curation.
An open data set that is considered similar in quality to DCLM, with more unique tokens but comparable overall data quality.
A controversial dataset that led to lawsuits against companies like Anthropic and Meta, with a court ruling it as fair use if the books were purchased.
A transformer-inspired foundation model that Dtology worked on, highlighting the importance of data curation.
A guest on the podcast who previously discussed model pruning and the lottery ticket hypothesis.
Mentioned as a product stemming from Meta's Reality Labs investment, potentially benefiting from AI foundations laid within the metaverse push.
A key advancement in deep reinforcement learning that followed AlexNet, contributing to Ari Morcos's interest in machine learning.
Scaling laws research that assumed IID data, which Ari Morcos argues is problematic and highlights the need for better data handling.
A 4 billion parameter model that the RC model was compared against, showing Datlogy's curation can lead to faster learning.
A foundational paper for Datlogy that demonstrated how data quality can bend scaling laws, showing a duality between power-law scaling and marginal information gain decay.
A paper that highlighted the importance of rephrasing for synthetic data, aligning with Dtology's earlier work.
An open effort by Levik Schmidt and students to curate Common Crawl data, serving as a benchmark for data quality.
A repository for papers, mentioned alongside code and book datasets.
A graduate student who worked with Ari Morcos on the 'Beyond Neural Scaling Laws' paper, proving the duality between power-law scaling and marginal information gain decay.
An individual from Cornell with whom Ari Morcos discussed information theoretical limits of model size.
More from Latent Space
View all 63 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free