How did Ari Morcos get into machine learning and data curation?

Ari Morcos's PhD was in neuroscience, where he used machine learning to analyze neural data. This empirical science background, combined with research insights in 2020, led him to focus on data as the most critical factor in ML performance, prompting the founding of Dtology.

What is the 'bitter lesson' in deep learning, and how did it affect Ari Morcos?

The 'bitter lesson' refers to the idea that large-scale, general methods like deep learning, which rely heavily on computation and data, tend to outperform methods with built-in inductive biases. This realization was confronting for Morcos, who had focused on inductive biases, leading him to pivot towards data.

Why is data considered the most underinvested area in AI research?

Despite data's massive impact, it has been historically viewed as 'plumbing' or low-prestige work. Research incentives often focused on model architecture, and scaling laws typically assumed IID data, ignoring quality variations. This underinvestment creates a significant opportunity.

What are the key advantages of focusing on data curation for training models?

Focusing on data curation allows for scientifically interesting questions to be practically relevant, leading to models that train faster, achieve better performance, and train with fewer parameters. This approach can act as a compute multiplier, improving efficiency.

Why are humans not ideal for data curation at scale?

Humans cannot scale to the vast amounts of data (trillions of tokens) now used in ML. More importantly, the value of a data point depends on its relation to the entire dataset, a complexity humans cannot grasp, making automated systems necessary and more effective.

How does Dtology's approach to data curation differ from simple data filtering?

Curation encompasses more than filtering; it includes rebalancing datasets, adjusting data weighting, sequencing data in curricula, batching strategies, and synthetic data generation. Filtering is just one aspect of a broader approach to optimizing data.

What are the two main approaches to synthetic data generation?

The first is 'net new creation' or distillation, where a model generates data based on its own knowledge, posing risks of model collapse. The second is rephrasing or rewriting, where a model reformats existing data, making it more digestible or accessible for training.

What is the significance of 'train faster,' 'train better,' and 'train smaller' for Dtology's customers?

Train faster allows for quicker iteration cycles. Train better means achieving higher performance for the same compute budget, making compute more valuable. Train smaller is crucial for reducing inference costs and total cost of ownership, especially for enterprise-grade, specialized models.

Why did curriculum learning fall out of favor and why is it relevant again?

Curriculum learning was less critical when datasets were fully saturated, as it only offered faster convergence, not necessarily better performance. With modern models underfitting data, curricula make sense for efficiency and effectiveness, guiding the learning process more intelligently.

What is the biggest barrier to better AI model performance today?

While training infrastructure has become commoditized, the significant barrier remains data quality and effective data curation. Dtology aims to lower this barrier, enabling any organization to access high-quality data for their first training run.

What are the key insights from the RC model training using Dtology's methods?

The RC model demonstrated that combining curated open datasets significantly improves learning speed, outperforming benchmark models like Gemma. It showcased that even starting with good datasets, there's substantial headroom for further improvement through advanced curation.

Key Moments

Better Data is All You Need — Ari Morcos, Datology

Latent Space Podcast

Science & Technology6 min read79 min video

Aug 29, 2025|7,075 views|153|13

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Data curation, not just models, is key to AI advancement. Better data yields faster, better, smaller models.

Key Insights

Data curation is the most impactful yet underinvested area in AI, crucial for training faster, better, and smaller models.

The prevailing focus on model architecture and compute scaling overlooks the fundamental principle that 'models are what they eat' – high-quality data is paramount.

Effective data curation involves sophisticated techniques like filtering, rebalancing, sequencing (curriculum learning), and synthetic data generation.

While industry values data work, the research community has historically treated it as a 'second-class citizen,' hindering progress.

Self-supervised learning has enabled massive scaling of data quantity, shifting the focus from data scarcity to managing massive, unlabeled datasets with issues like redundancy and low quality.

Automated data curation is essential, as humans cannot scale to process trillions of tokens, nor are they objectively good at identifying optimal data points due to their inability to consider the entire dataset context.

THE PARADOMOF DATA-DRIVEN AI

Current AI development heavily emphasizes model architecture and compute power, often at the expense of data quality. Ari Morcos argues that this focus is misguided, as "models are what they eat." High-quality, curated data is the true determinant of model performance, enabling models to be trained faster, achieve better results, and even be smaller in parameter count. Datology's mission is to democratize access to state-of-the-art data curation, making sophisticated techniques accessible without requiring specialized expertise.

FROM NEUROSCIENCE TO DATA SCIENCE

Morcos's journey into AI began with a background in neuroscience, where machine learning was initially a tool for analyzing complex neural data. This empirical, science-first approach led him to view deep learning as an empirical science, distinct from more theory-driven computer science disciplines. His early research aimed to understand *why* certain representations or model behaviors were effective, but he found a persistent disconnect between scientific understanding and practical application, a gap that ultimately led him to focus on data.

THE BITTER LESSON OF INDUCTIVE BIASES

Through several key research papers around 2020, Morcos experienced what he termed the "bitter lesson": inductive biases, once thought crucial for model performance, become largely irrelevant at scale. Experiments showed that while soft inductive biases (like initializing a Vision Transformer as a CNN) were helpful in small data regimes, they became detrimental as data size increased beyond a million points. This led to the realization that for large-scale models, the learned posterior from the data distribution is paramount, not architectural priors.

DATA AS THE UNDERINVESTED FRONTIER

Following the bitter lesson, Morcos identified two primary paths forward: improving hardware (GPU performance) or focusing on data. He advocates for data, citing it as dramatically underinvested relative to its impact. Many scaling law studies, for instance, naively assume Independent and Identically Distributed (IID) data, ignoring the well-known principle of "garbage in, garbage out." This underinvestment is partly due to cultural perceptions of data work as 'grunt work' and historical research incentives that treated datasets as fixed entities.

THE SHIFT TO SELF-SUPERVISED LEARNING AND DATA CHALLENGES

The advent of self-supervised learning in 2019 was a paradigm shift, enabling training on unlabeled data at unprecedented scales—trillions of tokens compared to datasets like ImageNet's millions. This move from data scarcity to data abundance created new challenges: redundancy, low quality, and low information gain within massive datasets. Consequently, models now often underfit the data rather than overfitting, making data curation—beyond simple filtering—critical for extracting maximum value and efficiency.

CURATION BEYOND FILTERING

Data curation encompasses more than just removing bad data points. it involves rebalancing datasets, upsampling or downsampling distributions, sequencing data through curricula, intelligent batching, and generating synthetic data. While filtering remains important, these broader curation strategies are essential for optimizing model training. Morcos emphasizes that understanding the value of a data point is not isolated but dependent on its relation to the entire training set, a task humans are ill-equipped to perform at scale.

AUTOMATION AND THE LIMITS OF HUMAN EXPERTISE

Studies, like one involving graduate students attempting to predict data classification outcomes, show that even human experts struggle to identify optimal data points for model training. Humans cannot mentally process vast datasets or grasp the complex interdependencies between data points. Therefore, automated data curation systems are not just a matter of scale but a necessity, as automated systems can consider these holistic relationships far more effectively than humans.

SYNTHETIC DATA: REPHRASING VS. DISTILLATION

Synthetic data generation is a key aspect of curation, with two primary approaches. Distillation involves generating new data largely from the generating model's knowledge, risking model collapse and mode-centric outputs. The preferred method, 'rephrasing' or rewriting, leverages existing data, with the model reformulating it into a more learnable or accessible format. This approach, where information originates from the source data, allows for the creation of better models than the generative ones, especially when combined with techniques like rejection sampling and careful filtering.

CURRICULUM LEARNING AND TRAINING STAGES

Curriculum learning, once doubted, is regaining prominence. The idea stems from the natural dependency of concepts, making ordered learning more efficient. While previously less critical due to data saturation in supervised learning, it is now vital in the era of underfitting models. Curriculum learning, including discrete curricula and continuous 'mid-training' and post-training phases, allows for more efficient use of compute and can significantly impact model performance and cost-effectiveness. Optimizing pre-training data to enhance post-training effectiveness is a key emerging area.

THE ECONOMICS OF BETTER DATA

Data curation acts as a compute multiplier, enhancing the value of compute resources. While 'train faster' offers immediate cost savings, the primary driver for most is 'train better'—achieving superior performance for the same compute budget. For advanced enterprises, 'train smaller' becomes paramount, drastically reducing inference costs, which dominate the total cost of ownership. Creating highly specialized, smaller models ('inch wide, mile deep') is more economical than using oversized general-purpose models.

DATA VS. PRUNING AND MODEL SIZE

While model pruning techniques like the lottery ticket hypothesis aimed to reduce model size, their effectiveness is often data-dependent and challenging to implement efficiently, especially with unstructured pruning. Morcos suggests that better data curation, leading to smaller, high-performing models from the outset, is more robust and complementary to other optimization methods like pruning and quantization. The trend suggests that future models will be significantly smaller, perhaps single-digit billions of parameters, optimizing inference cost and specialized task performance.

THE FUTURE: DATA AS THE ULTIMATE MOAT

Datology's long-term vision is to be the best in the world at valuing data relative to downstream use cases, treating data curation as the core competency. This involves automating adaptations to novel data distributions and tailoring curation to specific tasks. While open-source datasets provide foundational value, the true competitive advantage—the 'moat'—lies in proprietary data curation know-how and specialized infrastructure. The company aims to lower the barrier to high-quality data, enabling anyone to train effective models on their first try, moving beyond the limitations of current practices and raw data sources.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Books

●Studies Cited

●People Referenced

Common Questions

Dtology is a company focused on curating data for machine learning. Their mission is to make the data side of ML accessible and efficient, enabling users to train models faster, achieve better performance, and create smaller, more capable models.

Topics

Data Curation AI Efficiency Inductive Biases

Mentioned in this video

Studies & Research

Chinchilla scaling laws

Scaling laws research that assumed IID data, which Ari Morcos critiques for overlooking data quality's impact.

DCLM

A data curation project that showed human experts could not predict the filtering criteria of their automated system, highlighting the limitations of human judgment in data curation.

Nematron

An open data set that is considered similar in quality to DCLM, with more unique tokens but comparable overall data quality.

Kaplan scaling laws

Scaling laws research that assumed IID data, which Ari Morcos argues is problematic and highlights the need for better data handling.

KI paper

A paper that highlighted the importance of rephrasing for synthetic data, aligning with Dtology's earlier work.

DataComp

An open effort by Levik Schmidt and students to curate Common Crawl data, serving as a benchmark for data quality.

Beyond Neural Scaling Laws

A foundational paper for Datlogy that demonstrated how data quality can bend scaling laws, showing a duality between power-law scaling and marginal information gain decay.

Companies

Dtology

A startup focused on data curation for machine learning to help train models faster, better, and smaller.

People

Jonathan Franco

A guest on the podcast who previously discussed model pruning and the lottery ticket hypothesis.

Ben Sorcher

A graduate student who worked with Ari Morcos on the 'Beyond Neural Scaling Laws' paper, proving the duality between power-law scaling and marginal information gain decay.

Jack Morris

An individual from Cornell with whom Ari Morcos discussed information theoretical limits of model size.

Software & Apps

Char DQN

A key advancement in deep reinforcement learning that followed AlexNet, contributing to Ari Morcos's interest in machine learning.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Better Data is All You Need — Ari Morcos, Datology

Want to know something specific about what's covered?