How does ESMC differ from previous protein language models like ESM2?

ESMC incorporates a vast amount of metagenomic data alongside traditional protein sequence data. This addition of diverse evolutionary contexts removed diminishing returns to scale, making it more powerful than ESM2, which was data-limited.

What is the significance of metagenomic sequencing in ESMC's training data?

Metagenomic sequencing samples diverse environments (like hydrothermal vents or the human gut) and sequences all genetic material present. This provides a broad spectrum of protein sequences, revealing novel biological information and constraints that help the model learn more comprehensively.

How does ESMC's approach contrast with AlphaFold?

While AlphaFold incorporates significant inductive biases to predict protein structure, ESMC primarily uses a scaled-up transformer language model trained on vast protein sequence data. The goal is to learn the underlying structure and patterns directly from the data without pre-defined priors.

What are SCFVs and why are they important for protein design?

SCFVs (single-chain variable fragments) are a type of antibody, a critical modality for medicine, with about a quarter of new drugs being antibodies. ESMC can design SCFVs with therapeutic levels of affinity, and can even reformat them into full antibodies.

What unexpected insights has mechanistic interpretability provided from ESMC?

Mechanistic interpretability has revealed a hierarchical feature space in ESMC that corresponds to known biological concepts. It also identified clusters of evolutionarily distant gene editing systems, potentially pointing to novel systems yet to be discovered.

What is Biohub's overarching mission?

Biohub's mission is to cure or prevent disease by accelerating scientific understanding. They aim to build foundational tools and technologies, powered by experimental biology, advanced measurement technology, and frontier AI, to tackle biological complexity from the molecular level to physiology.

What are the key principles for the next era of biology, according to Alex Rives?

The key principles are data generation, computational predictive digital representations of biology (like ESMC), and feedback loops (integrating AI with experimental data and reasoning). This holistic approach aims to model complex biological systems from molecules to physiology.

How is Biohub's Virtual Biology Initiative addressing data gaps?

With a $500 million investment, Biohub is internally creating data and developing technology to scale data generation across multiple modalities. They are also committing $100 million to catalyze external efforts, hoping to spur a broad, collaborative approach to data collection.

What bottlenecks limit AI progress in biology?

While compute is a significant bottleneck for AI in general, in biology, the primary limitation is access to sufficient, high-quality data and the speed of experimental validation. Scaling both data generation and computational resources is crucial.

How much more protein sequence data is out there beyond what ESMC used?

While ESMC trained on about a billion sequences, it's estimated there are potentially 100 billion sequences. Even after clustering, there's still a vast amount of data, especially encompassing smaller variations that are crucial for understanding protein function.

What is the call to action for listeners interested in ESMC?

The ESMC model and world model for protein biology are being released as open-source under an MIT license. Biohub encourages researchers to use it, collaborate, and provide feedback to help accelerate scientific discovery.

Key Moments

🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Latent Space Podcast

Science & Technology5 min read71 min video

May 27, 2026|4,876 views|129|9

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Language models trained on billions of protein sequences can predict protein structure and function, designing new antibodies, but this requires massive datasets and compute power.

Key Insights

ESMC, trained on 6.8 billion non-redundant protein sequences, predicted structures for 1.1 billion of them, revealing a comprehensive picture of protein structure and function.

By incorporating metagenomic sequences alongside UNIREF, ESMC moved beyond the diminishing returns observed in ESM2, indicating that data, not just compute, was the key limitation.

Sparse autoencoders revealed a hierarchical feature space in ESMC that mirrors traditional biological understanding, from basic biochemical properties to complex functional themes, learned without prior biological knowledge.

ESMC's world modeling approach allows for the design of novel protein binders, including antibodies and SCFVs, achieving therapeutic-level affinity.

Biohub's Virtual Biology initiative aims to invest $400 million internally and $100 million externally to accelerate data generation and technological development for biological modeling.

The future of scientific discovery, particularly in biology, will integrate frontier experimental biology, advanced technology for measurement, and frontier artificial intelligence, fostering a feedback loop for accelerated learning.

The 'bitter lesson' applied to protein biology

The conversation explores the application of the 'bitter lesson' – the idea that scaling up general methods like large language models (LLMs) is more effective than hard-coding domain-specific knowledge – to protein biology. Alex Rives, Head of Science at Biohub, discusses how his team has developed protein language models, starting with ESM in 2018. The core concept is to train models on the evolutionary data of proteins and observe the emergence of biological understanding, such as structure and function, by simply predicting masked amino acids in a sequence. This approach, initially met with skepticism due to the perceived differences between natural language and protein sequences, has demonstrated remarkable success through increased scale in both data and model parameters, leading to unexpected capabilities.

Evolutionary scale modeling for proteins: From ESM to ESMC

The evolution of the ESM models, culminating in ESMC, illustrates the power of scaling. ESM2, trained on UNIREF, began to show diminishing returns in scaling. The breakthrough with ESMC came from incorporating metagenomic data, vastly expanding the diversity of protein sequences. This dataset, derived from various environments like hydrothermal vents and soil, provided billions of additional, often noisy, but evolutionarily diverse sequences. This expansion eliminated the diminishing returns, showing that ESM2 was data-limited. ESMC, trained on 6.8 billion non-redundant proteins and predicting structures for 1.1 billion, has produced a comprehensive picture of protein biology, discovering linkages across evolution and enabling novel protein design.

Mechanistic interpretability and emergent features

Using techniques like sparse autoencoders on the ESMC model, researchers can probe the internal representations learned by the model. These analyses reveal a hierarchical structure of features that remarkably mirrors established biological understanding. The model independently learns to represent basic biochemical properties, structural building blocks, and even complex functional themes, correlating with concepts developed over decades of biological research. A striking example is the model's unified representation of the 'nucleophilic elbow' motif, which it identifies across evolutionarily diverse and structurally distinct proteins. This suggests that the model is learning fundamental underlying biological principles that govern protein structure and function from the data alone, providing a powerful lens for understanding biological organization.

Designing novel proteins for therapeutic applications

The world-modeling capability of these large protein models, particularly ESMC, extends beyond prediction to de novo design. By searching the model's representation space, researchers can identify or design protein molecules that satisfy specific design criteria. This has led to the successful design of numerous protein binders, and more excitingly, antibodies and single-chain variable fragments (SCFVs). These designed antibodies have demonstrated therapeutic-level affinity, a critical benchmark for their potential use in medicine. The ability to design complex binding interfaces, even for modalities like SCFVs which combine heavy and light chains, represents a significant advancement in programmable biology and drug discovery.

The 'bitter lesson' contrasts with AlphaFold's approach

Unlike models like AlphaFold, which incorporate significant biological 'inductive biases' and rely heavily on multiple sequence alignments (MSAs), the ESM approach emphasizes learning from raw sequence data at scale. Rives suggests that AlphaFold's reliance on MSAs was crucial for its success, but the ESM approach demonstrates that emergent capabilities can be achieved without these explicit priors. In fact, the team's data indicates that for certain applications, like antibody design, their model may even outperform approaches that rely on evolutionary information in the same way. This highlights the power of large-scale, general-purpose learning for uncovering biological patterns.

Biohub's vision for accelerating biological discovery

Biohub, a philanthropic initiative, is committed to accelerating scientific discovery to cure and prevent disease. This involves building a scientific institution powered by frontier experimental biology, technology, and AI. A core component is the development of comprehensive digital representations of biological complexity, from molecular to physiological levels. This includes investing heavily in data creation and technology development, exemplified by their $500 million Virtual Biology initiative. The goal is to create models that can generalize, predict novel experimental outcomes, and eventually enable truly programmable biology, moving beyond existing virtual cell models that have limited predictive power in novel contexts.

The future: Data generation, feedback loops, and computational power

The future of biology hinges on overcoming bottlenecks in data generation and computational power. Biohub's initiatives focus on scaling data generation technologies, increasing the number of measurable modalities simultaneously, and reducing costs. They emphasize the need for speed, aiming to achieve progress in years rather than decades. The integration of experimental data with AI models through feedback loops, akin to reinforcement learning, is seen as critical. While compute is a recognized bottleneck, Rives emphasizes that both data and compute must scale in tandem. He notes that while ESM-1B was trained on a billion sequences, there are potentially orders of magnitude more sequences to be discovered, and that the 'bitter lesson' of scaling data still holds significant promise.

Open source and collaboration for scientific progress

Biohub champions open science, believing that providing tools like ESMC to the scientific community will accelerate progress. The ESMC model and its associated data will be open-sourced under an MIT license, encouraging widespread use and collaboration. The team is eager to work with other scientists, understand their needs, and build upon their findings. This collaborative, open-source ethos is central to Biohub's mission of advancing science broadly and accelerating the path towards curing and preventing diseases.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●People Referenced

Common Questions

The 'Bitter Lesson' suggests that scaling up computation and data, rather than relying on human-designed inductive biases, is the most effective path to advancing AI capabilities, even in complex domains like protein biology.

Topics

Health & Longevity AI & Machine Learning Technology & Innovation Science & Mathematics Deep Learning Protein Design Computational Biology AI In Drug Discovery Protein Language Models Machine Learning For Science Biohub Initiatives Generative AI For Biology

Mentioned in this video

People

Alex Rives

Head of Science at Biohub, a computer scientist working on AI for biology, specifically language models for protein biology. He believes in scaling laws and the Bitter Lesson theory.

R.J. Haneki

Host of the Latent Space AI for Science podcast and CTO of Muromix.

Mark Zuckerberg

Co-founder of Meta and a proponent of Biohub's mission. His previous appearance on the podcast was a catalyst for the science section, and he laid out an ambitious vision for Biohub.

Priscilla Chan

Co-founder of Meta and a proponent of Biohub's mission. Her previous appearance on the podcast was a catalyst for the science section, and she laid out an ambitious vision for Biohub.

Claude Shannon

A pioneer in information theory, known for his concept of the ideal predictor for the next character in a sequence and his calculation of the entropy of the English language.

Organizations

Biohub

A scientific institution aiming to cure or prevent disease by accelerating science through frontier experimental biology, technology for measurement, and artificial intelligence. They focus on building foundational tools and promoting open science.

PDB

Protein Data Bank, a repository of experimentally determined protein structures. The creation of the PDB is contrasted with the data generation for ESMC, highlighting the time and effort involved.

Human Cell Atlas

An initiative Biohub has supported that aims to create a comprehensive reference map of all human cells, which is building on efforts to create large cell atlases.

Companies

Muromix

The company where R.J. Haneki, the host of the Latent Space AI for Science podcast, is the CTO.

Software & Apps

ESM2

The previous generation protein language model trained by Rives's team. It showed diminishing returns to scale and was data-limited.

UNIREF

A gold standard dataset for sequence biology, created by clustering sequences from various resources to reduce redundancy and provide definitive coverage of protein biology.

AlphaFold

A protein structure prediction model known for incorporating inductive bias. ESMC is contrasted with AlphaFold, as ESMC learns structure without explicit priors.

Cell by Gene

A database of single-cell transcriptomics developed by Biohub, contributing to their efforts in understanding cellular biology.

Products

SCFVs

Single-chain variable fragments, a type of antibody that is a critical therapeutic modality. ESMC has shown success in designing these with high affinity.

Atlas

The first version of the ESM Atlas was used by Funang's group to discover a new gene editing system, demonstrating its potential for scientific discovery.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free