Key Moments
5 Papers That Show Where AI Research Is Heading Right Now
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
New AI models for protein biology show scaling laws hold similarly to language models, but require vast amounts of data and can perform comparably to hand-engineered systems like AlphaFold.
Key Insights
Scaling laws in protein biology, similar to language models, show improvements with increased compute and data, with ESM Cambrian models demonstrating continuous climbing performance.
The ESM Cambrian model achieved near-parity with AlphaFold 3 in protein structure prediction without using multiple sequence alignments (MSAs), and even surpassed it in antibody design tasks.
Self-play for LLMs, while promising for unbounded learning, currently plateaus due to the conjecturer generating overly complex and unhelpful tasks, requiring a 'guide' component to ensure relatedness and relevance.
Streaming RAG reduces latency in voice AI by analyzing spoken words in chunks and running retrieval while the query is still being formed, decreasing latency by up to 1.5 seconds without sacrificing accuracy.
Lean, a formal verification system, is enabling a new era of 'verified intelligence' by allowing AI to generate and verify complex mathematical proofs and potentially ensure code correctness and scientific reproducibility.
Agentic programming in software engineering is compared to real-time strategy games, emphasizing maximal parallelization, high visibility, continuous feedback, and satisficing over perfection to increase output by up to 3.5x engineer per month.
Scaling laws in biology mirror language models, but data is key
The research presented suggests that scaling laws, a fundamental driver of progress in language models, also apply to protein biology. Models like ESM Cambrian show that with increased parameters and, crucially, massive datasets (e.g., 2.8 billion sequences compared to 50 million in prior ESM2 models), performance continues to improve without plateauing. This "bitter lesson" from AI, as termed by Richard Sutton, posits that general methods leveraging scale often outperform hand-engineered domain knowledge. In protein biology, this translates to training models on vast evolutionary sequence data to predict protein structure and function. The paper highlights that biology's "training data"—evolutionary sequences—is orders of magnitude larger than human-generated text, suggesting immense potential for continued scaling. While LLM scaling laws are well-understood, their application to biology required validation. The presented work tests this by training models on protein sequences, treating amino acids as tokens. A key metric used is predicting 'long-distance contacts' in proteins, a proxy for understanding protein structure. The ESM Cambrian model, trained on extensive metagenomic data, demonstrated a clear log-linear improvement curve with compute, extrapolating cleanly from smaller training runs. This suggests that the principles of scaling compute and data are transferable, and that biology, like language, benefits from these general AI principles. The implication is that continued investment in data collection and model scale will yield predictable improvements in biological AI tasks.
AI models match and surpass specialized protein folding systems
A significant advancement discussed is the ability of general protein language models, trained solely on sequence data, to perform comparably to, or even outperform, highly specialized systems like AlphaFold 3, which rely on hand-engineered features like Multiple Sequence Alignments (MSAs). The ESM fold 2 model, using only per-residue embeddings from the language model as input to a structure predictor, achieved near-parity in general protein complex prediction. More strikingly, it outperformed AlphaFold 3 on antibody design tasks, an area critical for drug development. This success underscores the 'bitter lesson' by showing a generalist model can rival or beat specialist systems when trained at scale. The advantage of this general approach is highlighted by its speed and applicability to areas where MSAs are scarce, such as novel antibody design. The paper also noted that even when MSAs are available, the general model performs well, and performance can be further improved by scaling inference-time compute (e.g., using looped refinement networks). This suggests a paradigm shift where general protein language models can serve as powerful foundation models, reducing reliance on lengthy, specialized feature engineering for many biological tasks. The ability to generate interpretable features within the model's latent space, corresponding to biological motifs and functions, further bolsters this claim, indicating deep learning of biological principles without explicit supervision.
Self-play for LLMs struggles with task generation quality
Self-play, inspired by systems like AlphaZero, offers a path to unbounded learning for LLMs by having the model generate and solve its own tasks, moving beyond human-generated data. However, a paper on 'Scaling Selfplay with Selfguidance' revealed a critical flaw: the 'conjecturer' model, tasked with generating challenging problems, tends to produce overly complex, artificial, and unhelpful tasks. This is because it's rewarded simply for difficulty, leading to contrived problems that don't effectively improve the 'solver' model's capabilities on truly useful tasks. For instance, in formal mathematics, the conjecturer generated extremely convoluted problem statements, mere noise for genuine problem-solving. This resulted in self-play performing no better than standard Reinforcement Learning (RL) baselines, failing to progress beyond an asymptote, such as solving only 60% of formal math problems. To address this, the researchers introduced 'Self-Guided Selfplay' (SGS). SGS incorporates a 'guide' component that acts as a judge, evaluating whether generated synthetic tasks are genuinely related to a set of target problems (initially unsolved problems) and are not artificially complex. The conjecturer is then updated with a dual reward: one for task difficulty and another for the guide's score. This approach grounds the synthetic data generation in meaningful problem distributions and penalizes the creation of "junk" tasks. While SGS showed improvement, achieving the performance of a much larger model with a smaller one, it did not fully solve the problem, indicating that refining task generation remains a significant challenge for self-play in LLMs.
Streaming RAG enhances voice AI responsiveness
Latency is a major hurdle for natural conversational AI, especially in voice applications where rapid responses are expected. Traditional Retrieval-Augmented Generation (RAG) systems, while reducing hallucinations, add significant delay. A paper on 'Streaming RAG' proposes a solution by analyzing spoken words in real-time and initiating the RAG pipeline *while* the user is still speaking their query. This approach aims to reduce the overall interaction time by overlapping speech recognition, retrieval, LLM processing, and response generation. The core idea is to avoid waiting for the complete utterance before starting the retrieval process. The paper explores two methods: fixed-interval streaming RAG, which runs RAG on sequential audio chunks, and a more sophisticated approach that fine-tunes a model to dynamically decide when to trigger the RAG system based on the relevance and novelty of the incoming speech. This decision-making process can be based on factors like the quality of retrieval from partial queries or the semantic content of the partial utterance. Results showed latency reductions of up to 1.5 seconds on human datasets with comparable accuracy to standard RAG, making conversations feel more natural and responsive. This research highlights the importance of addressing these practical engineering challenges to unlock the full potential of conversational AI.
Lean enables verified intelligence and rigorous AI for science
The increasing success of AI in solving complex mathematical problems, like IMO gold medals and 80-year-old conjecture proofs, highlights a growing need for formal verification. Lean, a theorem prover and functional programming language, is at the forefront of this movement, enabling what's termed 'verified intelligence.' Unlike informal mathematics, which can be flexible but prone to errors, Lean requires explicit, rigorous proofs that cannot be fooled. This rigor is crucial not only for validating AI-generated proofs but also for ensuring the correctness and reproducibility of AI in science and software development. Lean's power lies in its expressivity, compatibility with programming paradigms (like I/O and meta-programming), and its extensive formalized math library. Tools like 'TorchLean' allow for the formalization of neural networks directly within Lean, enabling the verification of properties like certified robustness and even the correctness of highly optimized operations like FlashAttention. This capability extends to building verifiable code, ensuring that generated code meets rigorous specifications, a significant advancement over current LLMs that primarily focus on code generation without guaranteed correctness. The vision is a future where scientific discoveries and software are built upon a foundation of formal guarantees, increasing trust and reliability across AI applications.
Agentic programming mirrors RTS games for maximum productivity
Agentic programming, leveraging AI agents for software development, is likened to playing real-time strategy (RTS) games, demanding a shift in traditional programming assumptions. Instead of linear, thoughtful design, the focus is on hyper-parallelization, continuous feedback, and 'satisficing'—achieving 'good enough' rather than perfect results. This approach, exemplified by Channel AI's workflow, involves numerous autonomous agents working in parallel, managed by an orchestrator. Key practices derived from RTS include: running all work on cloud instances for portability, aggressively documenting extensively to aid future agents, and prioritizing high agent "actions per minute" (tool calls per minute) over human intervention speed. This methodology encourages a mindset akin to managing an RTS army: deploying many agents simultaneously, providing minimal but timely course correction, and using audio-visual cues for high-level monitoring rather than deep dives into each agent's progress. Errors are expected and factored into the workflow, with corrections made early to save overall time. The goal is to maximize parallel execution and continuously push work forward, akin to constantly producing units and micro-managing tasks across the map. This approach has led to significant increases in output, such as a 60% growth in PRs per engineer per month by adopting these principles, suggesting that optimizing for high throughput and iterative progress is more effective than striving for initial perfection in agent-assisted development.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Drugs & Medications
●People Referenced
Developing AI-Powered Software
Practical takeaways from this episode
Do This
Avoid This
Common Questions
The 'Bitter Lesson' refers to the observation stated by Richard Sutton that AI research progresses most effectively by scaling compute and data, rather than relying on hand-engineered human knowledge.
Topics
Mentioned in this video
AI token maxer, presenter discussing his work.
Presenter discussing Lean for science, a PhD student at Caltech.
Mentioned in the context of continuous learning and monotonic improvement.
Co-advisor of the speaker at Stanford, known for work in bioengineering and former director of Biohub.
Author of the famous 'Bitter Lesson' article in AI.
Alumnus from the speaker's lab whose work on looped models is built upon in the ESM projection networks.
Company that released the Composer 2 technical report.
Company where speaker Arnob is a researcher, working on stream RAG.
Low-Rank Adaptation, discussed as an alternative for fine-tuning, showing impressive performance at lower sample sizes.
Company that released work on solving mathematical problems and uses formal verification.
Company whose AI model solved all problems in the POKAM competition.
Company behind the 'fields metal work' related to AI and math.
The startup founded by Luke Worthwine, focused on consumer entertainment AI and automating development.
Company that claimed to solve an 80-year-old mathematical problem using AI.
Mentioned as a predecessor to AlphaZero and a benchmark in AI development.
Previous generation of protein language models from the same group, which showed diminishing returns with parameter scaling.
A landmark AI model in biology for protein structure prediction, which relies on multiple sequence alignments (MSAs).
Technical report from Cohere illustrating the benefits of scaling RL tasks.
A theorem prover and programming language used for formal mathematics and verification.
A framework using Lean as a functional programming language to help LLMs prove code.
A deep learning framework whose style is used for the TorchLean system.
Mentioned as an orchestrator agent for software development and as a source for generating presentations and code.
Mentioned as an alternative orchestrator agent for software development.
Version control system, with its work trees discussed as a useful tool for parallel development.
A project started by Clark Barry's group at Stanford for contributing to CS concepts.
A unified framework for writing and verifying neural networks in Lean.
More from Y Combinator
View all 598 summaries
31 minHow Meesho Became India’s Biggest Shopping App
55 minThe CEO Must Be the Chief AI Officer
30 minEmergent: How Six Months of Tinkering Led To A $100M ARR Company
23 minHow Legora Went From YC to $100M ARR in 18 Months
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free