Key Moments

🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space PodcastLatent Space Podcast
Science & Technology5 min read71 min video
May 27, 2026|1,037 views|25|1
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Language models trained on billions of protein sequences can predict protein structure and function, designing new antibodies, but this requires massive datasets and compute power.

Key Insights

1

ESMC, trained on 6.8 billion non-redundant protein sequences, predicted structures for 1.1 billion of them, revealing a comprehensive picture of protein structure and function.

2

By incorporating metagenomic sequences alongside UNIREF, ESMC moved beyond the diminishing returns observed in ESM2, indicating that data, not just compute, was the key limitation.

3

Sparse autoencoders revealed a hierarchical feature space in ESMC that mirrors traditional biological understanding, from basic biochemical properties to complex functional themes, learned without prior biological knowledge.

4

ESMC's world modeling approach allows for the design of novel protein binders, including antibodies and SCFVs, achieving therapeutic-level affinity.

5

Biohub's Virtual Biology initiative aims to invest $400 million internally and $100 million externally to accelerate data generation and technological development for biological modeling.

6

The future of scientific discovery, particularly in biology, will integrate frontier experimental biology, advanced technology for measurement, and frontier artificial intelligence, fostering a feedback loop for accelerated learning.

The 'bitter lesson' applied to protein biology

The conversation explores the application of the 'bitter lesson' – the idea that scaling up general methods like large language models (LLMs) is more effective than hard-coding domain-specific knowledge – to protein biology. Alex Rives, Head of Science at Biohub, discusses how his team has developed protein language models, starting with ESM in 2018. The core concept is to train models on the evolutionary data of proteins and observe the emergence of biological understanding, such as structure and function, by simply predicting masked amino acids in a sequence. This approach, initially met with skepticism due to the perceived differences between natural language and protein sequences, has demonstrated remarkable success through increased scale in both data and model parameters, leading to unexpected capabilities.

Evolutionary scale modeling for proteins: From ESM to ESMC

The evolution of the ESM models, culminating in ESMC, illustrates the power of scaling. ESM2, trained on UNIREF, began to show diminishing returns in scaling. The breakthrough with ESMC came from incorporating metagenomic data, vastly expanding the diversity of protein sequences. This dataset, derived from various environments like hydrothermal vents and soil, provided billions of additional, often noisy, but evolutionarily diverse sequences. This expansion eliminated the diminishing returns, showing that ESM2 was data-limited. ESMC, trained on 6.8 billion non-redundant proteins and predicting structures for 1.1 billion, has produced a comprehensive picture of protein biology, discovering linkages across evolution and enabling novel protein design.

Mechanistic interpretability and emergent features

Using techniques like sparse autoencoders on the ESMC model, researchers can probe the internal representations learned by the model. These analyses reveal a hierarchical structure of features that remarkably mirrors established biological understanding. The model independently learns to represent basic biochemical properties, structural building blocks, and even complex functional themes, correlating with concepts developed over decades of biological research. A striking example is the model's unified representation of the 'nucleophilic elbow' motif, which it identifies across evolutionarily diverse and structurally distinct proteins. This suggests that the model is learning fundamental underlying biological principles that govern protein structure and function from the data alone, providing a powerful lens for understanding biological organization.

Designing novel proteins for therapeutic applications

The world-modeling capability of these large protein models, particularly ESMC, extends beyond prediction to de novo design. By searching the model's representation space, researchers can identify or design protein molecules that satisfy specific design criteria. This has led to the successful design of numerous protein binders, and more excitingly, antibodies and single-chain variable fragments (SCFVs). These designed antibodies have demonstrated therapeutic-level affinity, a critical benchmark for their potential use in medicine. The ability to design complex binding interfaces, even for modalities like SCFVs which combine heavy and light chains, represents a significant advancement in programmable biology and drug discovery.

The 'bitter lesson' contrasts with AlphaFold's approach

Unlike models like AlphaFold, which incorporate significant biological 'inductive biases' and rely heavily on multiple sequence alignments (MSAs), the ESM approach emphasizes learning from raw sequence data at scale. Rives suggests that AlphaFold's reliance on MSAs was crucial for its success, but the ESM approach demonstrates that emergent capabilities can be achieved without these explicit priors. In fact, the team's data indicates that for certain applications, like antibody design, their model may even outperform approaches that rely on evolutionary information in the same way. This highlights the power of large-scale, general-purpose learning for uncovering biological patterns.

Biohub's vision for accelerating biological discovery

Biohub, a philanthropic initiative, is committed to accelerating scientific discovery to cure and prevent disease. This involves building a scientific institution powered by frontier experimental biology, technology, and AI. A core component is the development of comprehensive digital representations of biological complexity, from molecular to physiological levels. This includes investing heavily in data creation and technology development, exemplified by their $500 million Virtual Biology initiative. The goal is to create models that can generalize, predict novel experimental outcomes, and eventually enable truly programmable biology, moving beyond existing virtual cell models that have limited predictive power in novel contexts.

The future: Data generation, feedback loops, and computational power

The future of biology hinges on overcoming bottlenecks in data generation and computational power. Biohub's initiatives focus on scaling data generation technologies, increasing the number of measurable modalities simultaneously, and reducing costs. They emphasize the need for speed, aiming to achieve progress in years rather than decades. The integration of experimental data with AI models through feedback loops, akin to reinforcement learning, is seen as critical. While compute is a recognized bottleneck, Rives emphasizes that both data and compute must scale in tandem. He notes that while ESM-1B was trained on a billion sequences, there are potentially orders of magnitude more sequences to be discovered, and that the 'bitter lesson' of scaling data still holds significant promise.

Open source and collaboration for scientific progress

Biohub champions open science, believing that providing tools like ESMC to the scientific community will accelerate progress. The ESMC model and its associated data will be open-sourced under an MIT license, encouraging widespread use and collaboration. The team is eager to work with other scientists, understand their needs, and build upon their findings. This collaborative, open-source ethos is central to Biohub's mission of advancing science broadly and accelerating the path towards curing and preventing diseases.

Common Questions

The 'Bitter Lesson' suggests that scaling up computation and data, rather than relying on human-designed inductive biases, is the most effective path to advancing AI capabilities, even in complex domains like protein biology.

Topics

Mentioned in this video

More from Latent Space

View all 219 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free