Key Moments
🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Language models trained on billions of protein sequences can predict protein structure and function, designing new antibodies, but this requires massive datasets and compute power.
Key Insights
ESMC, trained on 6.8 billion non-redundant protein sequences, predicted structures for 1.1 billion of them, revealing a comprehensive picture of protein structure and function.
By incorporating metagenomic sequences alongside UNIREF, ESMC moved beyond the diminishing returns observed in ESM2, indicating that data, not just compute, was the key limitation.
Sparse autoencoders revealed a hierarchical feature space in ESMC that mirrors traditional biological understanding, from basic biochemical properties to complex functional themes, learned without prior biological knowledge.
ESMC's world modeling approach allows for the design of novel protein binders, including antibodies and SCFVs, achieving therapeutic-level affinity.
Biohub's Virtual Biology initiative aims to invest $400 million internally and $100 million externally to accelerate data generation and technological development for biological modeling.
The future of scientific discovery, particularly in biology, will integrate frontier experimental biology, advanced technology for measurement, and frontier artificial intelligence, fostering a feedback loop for accelerated learning.
The 'bitter lesson' applied to protein biology
The conversation explores the application of the 'bitter lesson' – the idea that scaling up general methods like large language models (LLMs) is more effective than hard-coding domain-specific knowledge – to protein biology. Alex Rives, Head of Science at Biohub, discusses how his team has developed protein language models, starting with ESM in 2018. The core concept is to train models on the evolutionary data of proteins and observe the emergence of biological understanding, such as structure and function, by simply predicting masked amino acids in a sequence. This approach, initially met with skepticism due to the perceived differences between natural language and protein sequences, has demonstrated remarkable success through increased scale in both data and model parameters, leading to unexpected capabilities.
Evolutionary scale modeling for proteins: From ESM to ESMC
The evolution of the ESM models, culminating in ESMC, illustrates the power of scaling. ESM2, trained on UNIREF, began to show diminishing returns in scaling. The breakthrough with ESMC came from incorporating metagenomic data, vastly expanding the diversity of protein sequences. This dataset, derived from various environments like hydrothermal vents and soil, provided billions of additional, often noisy, but evolutionarily diverse sequences. This expansion eliminated the diminishing returns, showing that ESM2 was data-limited. ESMC, trained on 6.8 billion non-redundant proteins and predicting structures for 1.1 billion, has produced a comprehensive picture of protein biology, discovering linkages across evolution and enabling novel protein design.
Mechanistic interpretability and emergent features
Using techniques like sparse autoencoders on the ESMC model, researchers can probe the internal representations learned by the model. These analyses reveal a hierarchical structure of features that remarkably mirrors established biological understanding. The model independently learns to represent basic biochemical properties, structural building blocks, and even complex functional themes, correlating with concepts developed over decades of biological research. A striking example is the model's unified representation of the 'nucleophilic elbow' motif, which it identifies across evolutionarily diverse and structurally distinct proteins. This suggests that the model is learning fundamental underlying biological principles that govern protein structure and function from the data alone, providing a powerful lens for understanding biological organization.
Designing novel proteins for therapeutic applications
The world-modeling capability of these large protein models, particularly ESMC, extends beyond prediction to de novo design. By searching the model's representation space, researchers can identify or design protein molecules that satisfy specific design criteria. This has led to the successful design of numerous protein binders, and more excitingly, antibodies and single-chain variable fragments (SCFVs). These designed antibodies have demonstrated therapeutic-level affinity, a critical benchmark for their potential use in medicine. The ability to design complex binding interfaces, even for modalities like SCFVs which combine heavy and light chains, represents a significant advancement in programmable biology and drug discovery.
The 'bitter lesson' contrasts with AlphaFold's approach
Unlike models like AlphaFold, which incorporate significant biological 'inductive biases' and rely heavily on multiple sequence alignments (MSAs), the ESM approach emphasizes learning from raw sequence data at scale. Rives suggests that AlphaFold's reliance on MSAs was crucial for its success, but the ESM approach demonstrates that emergent capabilities can be achieved without these explicit priors. In fact, the team's data indicates that for certain applications, like antibody design, their model may even outperform approaches that rely on evolutionary information in the same way. This highlights the power of large-scale, general-purpose learning for uncovering biological patterns.
Biohub's vision for accelerating biological discovery
Biohub, a philanthropic initiative, is committed to accelerating scientific discovery to cure and prevent disease. This involves building a scientific institution powered by frontier experimental biology, technology, and AI. A core component is the development of comprehensive digital representations of biological complexity, from molecular to physiological levels. This includes investing heavily in data creation and technology development, exemplified by their $500 million Virtual Biology initiative. The goal is to create models that can generalize, predict novel experimental outcomes, and eventually enable truly programmable biology, moving beyond existing virtual cell models that have limited predictive power in novel contexts.
The future: Data generation, feedback loops, and computational power
The future of biology hinges on overcoming bottlenecks in data generation and computational power. Biohub's initiatives focus on scaling data generation technologies, increasing the number of measurable modalities simultaneously, and reducing costs. They emphasize the need for speed, aiming to achieve progress in years rather than decades. The integration of experimental data with AI models through feedback loops, akin to reinforcement learning, is seen as critical. While compute is a recognized bottleneck, Rives emphasizes that both data and compute must scale in tandem. He notes that while ESM-1B was trained on a billion sequences, there are potentially orders of magnitude more sequences to be discovered, and that the 'bitter lesson' of scaling data still holds significant promise.
Open source and collaboration for scientific progress
Biohub champions open science, believing that providing tools like ESMC to the scientific community will accelerate progress. The ESMC model and its associated data will be open-sourced under an MIT license, encouraging widespread use and collaboration. The team is eager to work with other scientists, understand their needs, and build upon their findings. This collaborative, open-source ethos is central to Biohub's mission of advancing science broadly and accelerating the path towards curing and preventing diseases.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●People Referenced
Common Questions
The 'Bitter Lesson' suggests that scaling up computation and data, rather than relying on human-designed inductive biases, is the most effective path to advancing AI capabilities, even in complex domains like protein biology.
Topics
Mentioned in this video
Head of Science at Biohub, a computer scientist working on AI for biology, specifically language models for protein biology. He believes in scaling laws and the Bitter Lesson theory.
Host of the Latent Space AI for Science podcast and CTO of Muromix.
Co-founder of Meta and a proponent of Biohub's mission. His previous appearance on the podcast was a catalyst for the science section, and he laid out an ambitious vision for Biohub.
Co-founder of Meta and a proponent of Biohub's mission. Her previous appearance on the podcast was a catalyst for the science section, and she laid out an ambitious vision for Biohub.
A pioneer in information theory, known for his concept of the ideal predictor for the next character in a sequence and his calculation of the entropy of the English language.
A scientific institution aiming to cure or prevent disease by accelerating science through frontier experimental biology, technology for measurement, and artificial intelligence. They focus on building foundational tools and promoting open science.
Protein Data Bank, a repository of experimentally determined protein structures. The creation of the PDB is contrasted with the data generation for ESMC, highlighting the time and effort involved.
An initiative Biohub has supported that aims to create a comprehensive reference map of all human cells, which is building on efforts to create large cell atlases.
The previous generation protein language model trained by Rives's team. It showed diminishing returns to scale and was data-limited.
A gold standard dataset for sequence biology, created by clustering sequences from various resources to reduce redundancy and provide definitive coverage of protein biology.
A protein structure prediction model known for incorporating inductive bias. ESMC is contrasted with AlphaFold, as ESMC learns structure without explicit priors.
A database of single-cell transcriptomics developed by Biohub, contributing to their efforts in understanding cellular biology.
Single-chain variable fragments, a type of antibody that is a critical therapeutic modality. ESMC has shown success in designing these with high affinity.
The first version of the ESM Atlas was used by Funang's group to discover a new gene editing system, demonstrating its potential for scientific discovery.
More from Latent Space
View all 219 summaries
70 minDevin’s 80% Moment: Background Agents, 7x PRs, & End of Hand-Held Coding — Walden Yan & Cole Murray
30 min⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind
72 minAI Agents Need Computers: 74% MoM Growth, 850K/Day Runs, & New Agent Cloud — Ivan Burazin, Daytona
90 minThe Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free