Key Moments
π¬ Training Transformers to solve 95% failure rate of Cancer Trials β Ron Alfa & Daniel Bear, Noetik
Key Moments
95% of cancer drugs fail clinical trials, but Noetic claims it's a patient selection problem, not a drug discovery issue, leveraging AI to identify optimal patient cohorts for existing treatments.
Key Insights
The vast majority of cancer drugs, around 95%, fail to pass clinical trials, primarily due to inadequate patient selection.
Noetic's approach focuses on understanding patient biology from multimodal data to position molecules in the right patient populations, potentially improving success rates of existing treatments.
The company generates its own high-quality, intentionally designed multimodal data, arguing that it's not at the scale of other AI fields like NLP, and intentionality is crucial.
Noetic's models are trained on a massive dataset of over 100 million spatially resolved cells, paired with H&E, protein, and spatial transcriptomics data, which they claim is an order of magnitude larger than other datasets.
A significant deal worth $50 million was announced with GSK, involving licensing of Noetic's virtual cell foundation model (OctoVC), highlighting a shift towards AI model licensing in biotech.
Noetic utilizes a 'world model' approach, aiming to simulate biological processes and predict outcomes of actions like gene knockdowns, validated through in-vivo mouse perturbation experiments and in-silico humanization.
The staggering failure rate of cancer drugs and Noetic's contrarian thesis
A staggering 95% of cancer drugs fail in clinical trials, a statistic that alarms the pharmaceutical industry. Noetic, a biotech company, posits that this failure rate is not due to poor pharmacology or target selection, but rather the inability to accurately identify which patients will respond to specific treatments. Their core thesis is that by building models capable of deeply understanding patient biology from the outset, they can fundamentally improve patient selection, thereby increasing the success rates of existing and new therapies. This approach allows for both the discovery of new targets from patient data and the precise positioning of molecules within the right patient populations, and even the rescue of trials with poor initial outcomes by identifying responsive subgroups.
The limitations of traditional preclinical models
Traditionally, drug development relies heavily on experiments conducted in cell lines and animal models. However, the transcript discusses significant limitations in these approaches. Cancer cell lines are often immortalized, with abnormal genomes and gene expression patterns that do not accurately reflect human tumors. Similarly, animal models, while useful, often fail to translate effectively to human biology. Pharmaceutical companies may test drugs against hundreds of cell lines, but mapping this data back to human patients is challenging. This disconnect means that by the time a drug reaches clinical trials, the trial design is often based on limited insights into which patients might benefit, leading to broad, unfocused enrollment and subsequent trial failures. Noetic aims to bridge this gap by leveraging patient data directly.
Generating high-quality, multimodal data for AI training
Noetic emphasizes the critical importance of intentional data generation. Unlike simply collecting existing datasets, they meticulously source and process human tumor samples in their lab, generating multimodal data. This approach is crucial because, as they argue, the scale of high-quality biological data is not yet on par with that found in fields like natural language processing. They draw parallels to the ImageNet dataset, which catalyzed deep learning in computer vision, highlighting the need for curated, large-scale datasets. Noetic's data generation strategy is designed with specific principles in mind, such as leveraging images (H&E stains and multiplex fluorescence) for their scalability and information richness. They also focus on controlling for variables like batch effects by ensuring each patient's sample is represented across multiple slides and experimental batches, enabling more robust downstream analysis and model generalization.
The rich tapestry of Noetic's multimodal data
The data generated by Noetic integrates information from various biological layers to provide a comprehensive view of a patient's tumor. This includes standard H&E (Hematoxylin and Eosin) staining, which provides tissue-level structural information visible to pathologists. To understand cell types, they use multiplex fluorescence imaging with antibodies to identify different immune cells and other cell markers. Crucially, they incorporate spatial transcriptomics, which provides RNA expression data at a spatially resolved level, revealing the activity of thousands of genes within specific cells and their locations. This is complemented by DNA genotyping to understand genomic alterations. The combination of tissue structure, cellular composition, and molecular activity creates an information-dense dataset, conceptualized as a stack of image-like layers, each representing a different biological modality, from RGB images to multi-channel fluorescence and high-dimensional transcriptomic data.
Bridging the 'virtual cell' gap with practical, patient-centric models
The concept of a 'virtual cell' can be approached in two ways: simulating all biochemical processes exhaustively, or creating functional heuristics for drug development. Noetic leans towards the latter, viewing a virtual cell as a tool to understand cell biology in a way that is useful for making drugs. Their models simulate cell biology within specific contexts, predicting outputs like transcriptomes or protein levels. This allows for experimental design and simulation, enabling them to answer questions like how a cell's transcriptomic state changes in response to a drug or genetic perturbation. While some virtual cell models focus on single-cell gene expression in vitro, Noetic's approach integrates data directly from patients, believing this is more likely to translate to clinical outcomes than purely in vitro systems. They prioritize learning basic human biology from patient data rather than being biased by external clinical notes, aiming to discover underlying biological truths.
From data clusters to clinical decisions: Noetic's model deployment
Noetic's models, once trained on rich multimodal data, create 'latent spaces' that cluster patients based on their biological profiles. These clusters can then be used to guide drug development and clinical trial design. For instance, if a pharmaceutical company has a molecule targeting a specific pathway, Noetic can simulate its effect on different patient cohorts within their models to predict which patient groups are most likely to respond. This goes beyond simple mutation status, identifying patient subtypes driven by complex biology, whether genetic, immune-related, or other factors. The models can predict which sets of patients would benefit from a target, or how immune cells like T-cells would behave in a specific patient's tumor microenvironment. The simplest use case involves analyzing data from past trials where responders and non-responders have been identified, allowing the model to map these groups to distinct clusters and hypothesize why they responded differently, often with surprising interpretability down to gene expression levels.
The power of H&E and self-supervised learning for generalization
A key advantage of Noetic's approach is its reliance on H&E stained images for inference, a ubiquitous and standard pathology stain. This means their models, trained on rich multimodal data, can make predictions using just a digital H&E image, which is readily available from clinical trials and hospital archives. This flexibility is powerful because H&E is the common language of pathology. By analyzing these images, the models can predict gene expression patterns at specific locations within a tumor and classify patients into distinct clusters representing responders and non-responders. This interpretability is crucial; for example, if responders are predicted to express the drug's target protein, it validates the model's findings. It also highlights why simple, single-biomarker approaches are insufficient, as the complex biological variation predictive of therapeutic response is captured by the multimodal data analyzed by their models.
Building a data moat and advanced transformer architectures
Noetic differentiates itself through a significant 'data moat' β the sheer scale and quality of its training data. They have generated over 100 million spatially resolved cells, paired with H&E, protein, and spatial transcriptomics, an order of magnitude beyond existing datasets. This scale is critical, as they've observed that reducing training data by even 10-40% significantly degrades model performance and generalization. Complementing this data advantage are custom transformer architectures. Their model, Tario, represents an advancement over previous masked autoencoding approaches (like OctoVC), adopting an autoregressive, next-token prediction objective similar to LLMs. This architectural choice, combined with a specific focus on longer context lengths (seeing more tissue at once), enables better scaling and performance, particularly in capturing complex, non-linear patterns in spatial transcriptomics and inferring biological states from larger tissue regions. This approach allows them to simulate counterfactual perturbations without needing to run in vitro experiments.
In-vivo perturbations and 'in-silico humanization' for cross-species validation
To validate their models beyond human patient data, Noetic employs sophisticated in-vivo perturbation experiments using a platform called Perturb Map. This involves creating barcoded gene knockouts within cancer cells, which are then injected into mice. Each mouse can host hundreds of tumors with different genetic perturbations. By spatially resolving the biology of these barcoded tumors, they can map human tumor biology to mouse models and validate their predictive models. A key innovation is their 'in-silico humanization' technique, which translates mouse transcriptomic readouts into their human gene equivalents. This allows them to infer human biology directly from mouse experiments, addressing the challenge of differing genetic landscapes between species, and enabling more direct connectivity between mouse systems and human biology, ultimately supporting drug development by providing biologically relevant targets and insights transferable to humans.
The GSK deal and the future of AI model licensing in biotech
Noetic recently announced a significant $50 million deal with GSK, involving the licensing of their OctoVC virtual cell foundation model. This landmark agreement, including an upfront payment, milestones, and annual licensing fees, highlights a growing trend of AI model licensing in the biopharma sector. The deal leverages Noetic's pre-trained models on lung and colon cancer, allowing GSK to use them for internal research and fine-tune them on their vast internal translational datasets. This signifies a shift from traditional molecule-based collaborations to model-centric business development, recognizing the value of foundational AI models in accelerating drug discovery and development across pipelines. The appetite for such deals is driven by pharma's recognition of AI's capabilities, the increasing availability of data, and the potential for broad licensing across multiple therapeutic programs.
The crucial role of data scale and conviction in AI for biology
The conversation repeatedly touches upon the necessity of generating substantial, high-quality data in AI for biology. Noetic's journey, starting lab operations and data generation from scratch, took about 18 months before they had enough data to train their first models effectively. This process is data-intensive and expensive, requiring significant upfront investment and conviction. Unlike fields where massive public datasets exist, biotech often necessitates in-house data generation. The speakers emphasize that there's a critical threshold below which AI models will not yield meaningful signals. Companies must have the conviction to collect data at scale, anticipating future algorithmic advancements. This mirrors historical scientific progress, like Tycho Brahe's extensive astronomical observations that later enabled Kepler and Newton to formulate their laws. The challenge lies in who generates this data and captures its value, with Noetic betting on their disciplined, high-quality data generation as a core differentiator.
A new era for machine learning in biological sciences
The speakers express optimism and excitement about the current state and future of machine learning in biology. They see analogies to the 'ChatGPT moment' for biology, suggesting we are at the very beginning of a revolution. While acknowledging the progress in areas like protein structure prediction, they stress that solving individual AI problems won't immediately solve the broader challenges of developing better therapeutics. Noetic's focus on building patient-level foundation models for precise treatment selection represents a specific, crucial slice of this larger endeavor. They encourage broader engagement with ML in biological sciences, highlighting that significant, humanity-impacting problems remain, requiring innovative ML solutions and considerable data generation. The field is ripe for contributions, especially for those interested in tackling complex, frontier ML challenges with the potential for profound impact.
Mentioned in This Episode
βSoftware & Apps
βCompanies
βOrganizations
βStudies Cited
βPeople Referenced
Common Questions
Noetic aims to solve the high failure rate (90-95%) of cancer drug trials, which they believe is primarily due to poor patient selection rather than issues with drug discovery or target identification.
Topics
Mentioned in this video
The company founded with a contrarian thesis to address the high failure rate of cancer drugs by improving patient selection through AI models.
The Food and Drug Administration, which historically required animal data for new drug mechanisms, posing a challenge for companies with strong human-derived data.
A standard pathology stain used to visualize tissue structures, which pathologists use to classify tumors and which Noetic uses as a primary input for its models.
A technique that measures RNA expression within its spatial context in a tissue sample, providing molecular information at a cellular or sub-cellular level.
Noetic's first virtual cell foundation model, trained using masked autoencoding, which was licensed to GSK.
A newer transformer architecture developed by Noetic, utilizing an autoregressive training objective similar to LLMs, showing improved scaling behavior.
A large, curated image dataset that was crucial for the advancement of deep learning in computer vision.
Large Language Models, whose success with next-token prediction training is a point of comparison for scaling AI models in other modalities like biology.
A language model using masked autoencoding, used as a comparison point for Noetic's earlier masked autoencoder training objective.
Co-founder and CEO of Noetic, a positional scientist by training.
Co-host of the 'Late Space Science' podcast and interviewer.
Co-host of the 'Late Space Science' podcast and interviewer.
VP of AI at Noetic, with a background in biology, neuroscience, computer vision, and self-supervised learning.
More from Latent Space
View all 208 summaries
49 minβ‘οΈ How to turn Documents into Knowledge: Graphs in Modern AI β Emil Eifrem, CEO Neo4J
86 minNotionβs Sarah Sachs & Simon Last on Custom Agents, Evals, and the Future of Work
58 minβ‘οΈ The best engineers don't write the most code. They delete the most code. β Stay Sassy
78 minExtreme Harness Engineering for the 1B token/day Dark Factory β Ryan Lopopolo, OpenAI Frontier
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Get Started Free