Why are traditional pre-clinical models like cell lines and animal models insufficient for cancer drug development?

Cancer cell lines are often genetically abnormal and do not represent actual human tumors. Animal models, while useful, frequently fail to accurately translate to human biology, leading to a gap in predicting drug efficacy.

How does Noetic use AI to improve patient selection for cancer drugs?

Noetic builds AI models trained on comprehensive patient biology data. These models can predict which patient subtypes are most likely to respond to a specific drug or help discover new targets that match particular patient populations.

What types of data does Noetic generate for training their AI models?

Noetic generates multimodal data including pathology H&E stains, protein stains (immunofluorescence), spatial transcriptomics (RNA expression in situ), and DNA genotyping from human tumor samples.

What is spatial transcriptomics and why is it important?

Spatial transcriptomics measures RNA expression within its original tissue location, providing detailed molecular information about cell types and their spatial organization. This is crucial for understanding complex biological systems and training sophisticated AI models.

What is Noetic's approach to building AI models for biology?

Noetic focuses on self-supervised learning using large-scale, high-quality, multimodal human patient data. They aim to build 'world models' that can simulate biological processes and predict outcomes, rather than relying solely on simplified cell cultures or animal models.

How does Noetic's Tario model differ from previous models like OctoVC?

Tario uses an autoregressive training objective, similar to LLMs, which requires predicting the next token in a sequence. This approach has shown better scaling behavior, especially with longer context lengths, allowing models to learn more complex patterns in spatial transcriptomics data.

What was significant about Noetic's deal with GSK?

The $50 million deal marked one of the first instances of a foundation model licensing agreement in the biopharma space, allowing GSK access to Noetic's trained models for drug discovery and enabling them to fine-tune these models on their own proprietary data.

Why is generating large, high-quality datasets critical for AI in biology?

Biological data is complex and diverse. Achieving significant progress in AI models requires massive, uniform datasets much like those in computer vision and language. Without sufficient data scale and quality, algorithms cannot be properly trained or validated.

What advice does Noetic have for biotech startups collecting their own data?

Startups should define the machine learning problem first and then design data collection to specifically address it. Paying attention to rapidly evolving technology and understanding the necessary data scale and diversity is crucial for success.

How does Noetic validate its models and approach?

Validation involves testing if models can recognize known biological truths (e.g., predicting patient response to immune checkpoint inhibitors) and performing in-vivo perturbation experiments in mice that map to human data, as well as novel findings that align with biological principles.

What is the future outlook for AI in biology and drug development?

The field is rapidly advancing, with significant potential for machine learning to revolutionize therapeutic development. However, it requires substantial data generation, solving complex ML problems, and translating methods into successful drugs, with Noetic focusing on patient-level foundation models.

Key Moments

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Latent Space Podcast

Science & Technology9 min read86 min video

Apr 20, 2026|179 views|4

Save to Pod

Key Moments

On this page

TL;DR

95% of cancer drugs fail clinical trials, but Noetic claims it's a patient selection problem, not a drug discovery issue, leveraging AI to identify optimal patient cohorts for existing treatments.

Key Insights

The vast majority of cancer drugs, around 95%, fail to pass clinical trials, primarily due to inadequate patient selection.

Noetic's approach focuses on understanding patient biology from multimodal data to position molecules in the right patient populations, potentially improving success rates of existing treatments.

The company generates its own high-quality, intentionally designed multimodal data, arguing that it's not at the scale of other AI fields like NLP, and intentionality is crucial.

Noetic's models are trained on a massive dataset of over 100 million spatially resolved cells, paired with H&E, protein, and spatial transcriptomics data, which they claim is an order of magnitude larger than other datasets.

A significant deal worth $50 million was announced with GSK, involving licensing of Noetic's virtual cell foundation model (OctoVC), highlighting a shift towards AI model licensing in biotech.

Noetic utilizes a 'world model' approach, aiming to simulate biological processes and predict outcomes of actions like gene knockdowns, validated through in-vivo mouse perturbation experiments and in-silico humanization.

The staggering failure rate of cancer drugs and Noetic's contrarian thesis

A staggering 95% of cancer drugs fail in clinical trials, a statistic that alarms the pharmaceutical industry. Noetic, a biotech company, posits that this failure rate is not due to poor pharmacology or target selection, but rather the inability to accurately identify which patients will respond to specific treatments. Their core thesis is that by building models capable of deeply understanding patient biology from the outset, they can fundamentally improve patient selection, thereby increasing the success rates of existing and new therapies. This approach allows for both the discovery of new targets from patient data and the precise positioning of molecules within the right patient populations, and even the rescue of trials with poor initial outcomes by identifying responsive subgroups.

The limitations of traditional preclinical models

Traditionally, drug development relies heavily on experiments conducted in cell lines and animal models. However, the transcript discusses significant limitations in these approaches. Cancer cell lines are often immortalized, with abnormal genomes and gene expression patterns that do not accurately reflect human tumors. Similarly, animal models, while useful, often fail to translate effectively to human biology. Pharmaceutical companies may test drugs against hundreds of cell lines, but mapping this data back to human patients is challenging. This disconnect means that by the time a drug reaches clinical trials, the trial design is often based on limited insights into which patients might benefit, leading to broad, unfocused enrollment and subsequent trial failures. Noetic aims to bridge this gap by leveraging patient data directly.

Generating high-quality, multimodal data for AI training

Noetic emphasizes the critical importance of intentional data generation. Unlike simply collecting existing datasets, they meticulously source and process human tumor samples in their lab, generating multimodal data. This approach is crucial because, as they argue, the scale of high-quality biological data is not yet on par with that found in fields like natural language processing. They draw parallels to the ImageNet dataset, which catalyzed deep learning in computer vision, highlighting the need for curated, large-scale datasets. Noetic's data generation strategy is designed with specific principles in mind, such as leveraging images (H&E stains and multiplex fluorescence) for their scalability and information richness. They also focus on controlling for variables like batch effects by ensuring each patient's sample is represented across multiple slides and experimental batches, enabling more robust downstream analysis and model generalization.

The rich tapestry of Noetic's multimodal data

The data generated by Noetic integrates information from various biological layers to provide a comprehensive view of a patient's tumor. This includes standard H&E (Hematoxylin and Eosin) staining, which provides tissue-level structural information visible to pathologists. To understand cell types, they use multiplex fluorescence imaging with antibodies to identify different immune cells and other cell markers. Crucially, they incorporate spatial transcriptomics, which provides RNA expression data at a spatially resolved level, revealing the activity of thousands of genes within specific cells and their locations. This is complemented by DNA genotyping to understand genomic alterations. The combination of tissue structure, cellular composition, and molecular activity creates an information-dense dataset, conceptualized as a stack of image-like layers, each representing a different biological modality, from RGB images to multi-channel fluorescence and high-dimensional transcriptomic data.

Bridging the 'virtual cell' gap with practical, patient-centric models

The concept of a 'virtual cell' can be approached in two ways: simulating all biochemical processes exhaustively, or creating functional heuristics for drug development. Noetic leans towards the latter, viewing a virtual cell as a tool to understand cell biology in a way that is useful for making drugs. Their models simulate cell biology within specific contexts, predicting outputs like transcriptomes or protein levels. This allows for experimental design and simulation, enabling them to answer questions like how a cell's transcriptomic state changes in response to a drug or genetic perturbation. While some virtual cell models focus on single-cell gene expression in vitro, Noetic's approach integrates data directly from patients, believing this is more likely to translate to clinical outcomes than purely in vitro systems. They prioritize learning basic human biology from patient data rather than being biased by external clinical notes, aiming to discover underlying biological truths.

From data clusters to clinical decisions: Noetic's model deployment

Noetic's models, once trained on rich multimodal data, create 'latent spaces' that cluster patients based on their biological profiles. These clusters can then be used to guide drug development and clinical trial design. For instance, if a pharmaceutical company has a molecule targeting a specific pathway, Noetic can simulate its effect on different patient cohorts within their models to predict which patient groups are most likely to respond. This goes beyond simple mutation status, identifying patient subtypes driven by complex biology, whether genetic, immune-related, or other factors. The models can predict which sets of patients would benefit from a target, or how immune cells like T-cells would behave in a specific patient's tumor microenvironment. The simplest use case involves analyzing data from past trials where responders and non-responders have been identified, allowing the model to map these groups to distinct clusters and hypothesize why they responded differently, often with surprising interpretability down to gene expression levels.

The power of H&E and self-supervised learning for generalization

A key advantage of Noetic's approach is its reliance on H&E stained images for inference, a ubiquitous and standard pathology stain. This means their models, trained on rich multimodal data, can make predictions using just a digital H&E image, which is readily available from clinical trials and hospital archives. This flexibility is powerful because H&E is the common language of pathology. By analyzing these images, the models can predict gene expression patterns at specific locations within a tumor and classify patients into distinct clusters representing responders and non-responders. This interpretability is crucial; for example, if responders are predicted to express the drug's target protein, it validates the model's findings. It also highlights why simple, single-biomarker approaches are insufficient, as the complex biological variation predictive of therapeutic response is captured by the multimodal data analyzed by their models.

Building a data moat and advanced transformer architectures

Noetic differentiates itself through a significant 'data moat' – the sheer scale and quality of its training data. They have generated over 100 million spatially resolved cells, paired with H&E, protein, and spatial transcriptomics, an order of magnitude beyond existing datasets. This scale is critical, as they've observed that reducing training data by even 10-40% significantly degrades model performance and generalization. Complementing this data advantage are custom transformer architectures. Their model, Tario, represents an advancement over previous masked autoencoding approaches (like OctoVC), adopting an autoregressive, next-token prediction objective similar to LLMs. This architectural choice, combined with a specific focus on longer context lengths (seeing more tissue at once), enables better scaling and performance, particularly in capturing complex, non-linear patterns in spatial transcriptomics and inferring biological states from larger tissue regions. This approach allows them to simulate counterfactual perturbations without needing to run in vitro experiments.

In-vivo perturbations and 'in-silico humanization' for cross-species validation

To validate their models beyond human patient data, Noetic employs sophisticated in-vivo perturbation experiments using a platform called Perturb Map. This involves creating barcoded gene knockouts within cancer cells, which are then injected into mice. Each mouse can host hundreds of tumors with different genetic perturbations. By spatially resolving the biology of these barcoded tumors, they can map human tumor biology to mouse models and validate their predictive models. A key innovation is their 'in-silico humanization' technique, which translates mouse transcriptomic readouts into their human gene equivalents. This allows them to infer human biology directly from mouse experiments, addressing the challenge of differing genetic landscapes between species, and enabling more direct connectivity between mouse systems and human biology, ultimately supporting drug development by providing biologically relevant targets and insights transferable to humans.

The GSK deal and the future of AI model licensing in biotech

Noetic recently announced a significant $50 million deal with GSK, involving the licensing of their OctoVC virtual cell foundation model. This landmark agreement, including an upfront payment, milestones, and annual licensing fees, highlights a growing trend of AI model licensing in the biopharma sector. The deal leverages Noetic's pre-trained models on lung and colon cancer, allowing GSK to use them for internal research and fine-tune them on their vast internal translational datasets. This signifies a shift from traditional molecule-based collaborations to model-centric business development, recognizing the value of foundational AI models in accelerating drug discovery and development across pipelines. The appetite for such deals is driven by pharma's recognition of AI's capabilities, the increasing availability of data, and the potential for broad licensing across multiple therapeutic programs.

The crucial role of data scale and conviction in AI for biology

The conversation repeatedly touches upon the necessity of generating substantial, high-quality data in AI for biology. Noetic's journey, starting lab operations and data generation from scratch, took about 18 months before they had enough data to train their first models effectively. This process is data-intensive and expensive, requiring significant upfront investment and conviction. Unlike fields where massive public datasets exist, biotech often necessitates in-house data generation. The speakers emphasize that there's a critical threshold below which AI models will not yield meaningful signals. Companies must have the conviction to collect data at scale, anticipating future algorithmic advancements. This mirrors historical scientific progress, like Tycho Brahe's extensive astronomical observations that later enabled Kepler and Newton to formulate their laws. The challenge lies in who generates this data and captures its value, with Noetic betting on their disciplined, high-quality data generation as a core differentiator.

A new era for machine learning in biological sciences

The speakers express optimism and excitement about the current state and future of machine learning in biology. They see analogies to the 'ChatGPT moment' for biology, suggesting we are at the very beginning of a revolution. While acknowledging the progress in areas like protein structure prediction, they stress that solving individual AI problems won't immediately solve the broader challenges of developing better therapeutics. Noetic's focus on building patient-level foundation models for precise treatment selection represents a specific, crucial slice of this larger endeavor. They encourage broader engagement with ML in biological sciences, highlighting that significant, humanity-impacting problems remain, requiring innovative ML solutions and considerable data generation. The field is ripe for contributions, especially for those interested in tackling complex, frontier ML challenges with the potential for profound impact.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Studies Cited

●People Referenced

Common Questions

Noetic aims to solve the high failure rate (90-95%) of cancer drug trials, which they believe is primarily due to poor patient selection rather than issues with drug discovery or target identification.

Topics

Health & Longevity Technology & Innovation Science & Mathematics Precision Medicine Foundation Models Cancer Research Computational Biology AI In Drug Discovery Machine Learning For Biology Data Generation In Biotech

Mentioned in this video

Organizations

Noetic

The company founded with a contrarian thesis to address the high failure rate of cancer drugs by improving patient selection through AI models.

FDA

The Food and Drug Administration, which historically required animal data for new drug mechanisms, posing a challenge for companies with strong human-derived data.

Software & Apps

H&E stain

A standard pathology stain used to visualize tissue structures, which pathologists use to classify tumors and which Noetic uses as a primary input for its models.

spatial transcriptomics

A technique that measures RNA expression within its spatial context in a tissue sample, providing molecular information at a cellular or sub-cellular level.

OctoVC

Noetic's first virtual cell foundation model, trained using masked autoencoding, which was licensed to GSK.

Tario

A newer transformer architecture developed by Noetic, utilizing an autoregressive training objective similar to LLMs, showing improved scaling behavior.

ImageNet

A large, curated image dataset that was crucial for the advancement of deep learning in computer vision.

LLMs

Large Language Models, whose success with next-token prediction training is a point of comparison for scaling AI models in other modalities like biology.

BERT

A language model using masked autoencoding, used as a comparison point for Noetic's earlier masked autoencoder training objective.

Companies

GSK

A pharmaceutical company that licensed Noetic's OctoVC foundation model in a $50 million deal.

Recursion

A previous employer of Ron Alfa and Dan Bear, mentioned as a source of principles for building robust datasets and understanding experimental design.

People

Ron Alfa

Co-founder and CEO of Noetic, a positional scientist by training.

R.J. Haniki

Co-host of the 'Late Space Science' podcast and interviewer.

Brandon Sanderson

Co-host of the 'Late Space Science' podcast and interviewer.

Dan Bear

VP of AI at Noetic, with a background in biology, neuroscience, computer vision, and self-supervised learning.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free