Key Moments
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Key Moments
Yi Tay of Reka discusses LLM research, Reka's models, Google Brain experience, and AI trends.
Key Insights
Reka Core's success demonstrates that smaller, well-funded labs can compete with giants in LLM development.
The transition in AI research has shifted from task-specific fine-tuning to large, general-purpose foundation models.
Yi Tay's career path highlights prioritizing impactful and promising research over rigidly planned trends.
Compute reliability and efficient orchestration are significant operational challenges in LLM training.
Multi-expert (MoE) architectures are a promising direction for balancing performance and computational cost.
The debate between open-source and closed-source LLMs is complex, with incentives playing a major role.
JOURNEY FROM GOOGLE BRAIN TO REKA
Yi Tay's illustrious career began at Google Brain, where he co-led PaLM 2 and invented UL2, significantly contributing to models like Flan and the bcore team. His transition to Reka in March 2023, following a $58 million Series A in June 2023, signals a strategic move to build universal, multimodal, and multilingual intelligence agents. Reka's rapid model releases, including Flash, Core, and Edge, highlight their ambitious goals for self-improving AI and model efficiency, underscoring a sharp focus on impactful research.
EVOLUTION OF LLM RESEARCH PARADIGMS
Tay observes a significant shift in machine learning research, moving from task-specific fine-tuning of models like T5 and BERT to the current paradigm of large, general-purpose foundation models. He notes that while the underlying principles of Transformer architecture and research haven't fundamentally changed, the scale of compute and data has dramatically increased. This evolution, accelerated by events like the ChatGPT launch, has redefined AI research goals towards universal intelligence rather than domain-specific optimizations.
STRATEGIC CAREER GROWTH AND RESEARCH PHILOSOPHY
Reflecting on his career, Tay emphasizes a philosophy of optimizing for impact and promising research rather than proactively chasing trends. His involvement in PaLM 2's development stemmed organically from the success of his personal project, UL2. This approach, combined with strong collaborations and an open-minded adaptability to the rapidly shifting field, allowed him to navigate complex research landscapes and contribute to significant breakthroughs.
CHALLENGES IN LLM INFRASTRUCTURE AND TRAINING
Building state-of-the-art LLMs at Reka highlighted significant operational challenges, particularly around compute reliability. Tay describes the frustrating experience of GPU delays and unreliable hardware, which significantly impacted training runs. The decision to use GPUs over TPUs was based on familiarity and existing infrastructure. He stressed that compute providers must offer better risk-sharing models, as brittle infrastructure can devastate startups by wasting precious training time and resources, impacting work-life balance due to the constant anxiety.
ARCHITECTURAL INNOVATIONS AND MODEL DESIGN
Reka's models, such as Reka Core and Flash, incorporate advanced architectural choices, including aspects of the 'Gnome Architecture' characterized by gated linear units (GLU variants like SwiGLU), grouped query attention (GQA), and RoPE embeddings. Tay finds GQA a no-brainer for inference benefits and appreciates RoPE for its extrapolation properties. He also discusses the nuanced benefits of encoder-decoder architectures, noting their intrinsic sparsity that allows for greater parameter efficiency compared to decoder-only models at the same compute budget, especially when dealing with multimodal inputs.
THE MULTIMODAL FRONTIER AND EVALUATION CHALLENGES
The AI field is increasingly trending towards early fusion in multimodal models, integrating different modalities from the outset for deeper understanding, a direction Reka and players like OpenAI (GPT-4o) are pursuing. Tay believes that while late fusion will persist due to practical constraints, early fusion represents the more robust long-term approach. He also notes the critical need for better evaluation benchmarks, especially for long-context models and multimodal capabilities, to guide progress effectively and avoid the contamination plaguing current benchmarks.
SCALING LAWS, EFFICIENCY, AND FUTURE TRAJECTORIES
Tay views Chinchilla's scaling laws as a misunderstood guideline rather than a strict limit, noting that models like LLaMA 3 have significantly surpassed them. He advocates for a holistic view of efficiency, considering not just active parameters or theoretical FLOPs but also practical throughput, inference speed, and serving costs. He is bullish on Mixture-of-Experts (MoE) architectures for their favorable compute-to-parameter ratio, believing they are a key enabler for continued scaling, though the nuances of their impact on capabilities beyond performance metrics remain an active research question.
THE OPEN VERSUS CLOSED-SOURCE DEBATE
Tay draws a distinction between open-weight models released by large labs (like Meta's LLaMA) and grassroots, bottom-up open-source efforts. While not inherently against open source, he observes that many community-driven initiatives often rely on rebranding existing models or quick wins through fine-tuning, which may not lead to fundamental advancements. He suggests that a lack of substantial reward signals for purely derivative open-source work could limit its long-term impact compared to the sustained, resource-intensive development seen in closed-source models.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
●People Referenced
Common Questions
Yi Tay's transition happened organically as the field evolved. He optimized for impactful and promising areas, collaborating with influential people like Jason Wei. The widespread adoption of GPT-3 and ChatGPT significantly shifted the research meta from task-specific fine-tuning to universal foundation models.
Topics
Mentioned in this video
Yi Tay was the architecture co-lead on Palm 2 at Google Brain, a significant company-wide effort in large language model development.
Yi Tay was a core contributor to Flan at Google Brain, a project mainly led by Hieu Pham and Sharan Narang focusing on instruction tuning.
Mentioned as an example of models that researchers were fine-tuning in late 2019, before the focus shifted entirely to foundational models.
Models released by Reka AI, which achieved state-of-the-art results even against larger models from bigger labs, showcasing the team's ability to achieve high performance with efficient cycles.
Yi Tay contributed to Palm 1 and was a co-lead for the modeling workstream of Palm 2.
Hypothetical example of a Transformer alternative that may show good performance at small scales but whose implications for larger models are unknown.
Yi Tay is the inventor of UL2, which he started as a personal project during a break and became the largest encoder-decoder model released by Google at the time.
The launch of ChatGPT in November 2022 drastically changed the AI research landscape, making previous task-specific work largely obsolete and accelerating the focus on general-purpose models.
Used by Reka AI for some orchestration tasks, but Yi Tay notes that generalized orchestration tools for ML experimentation are still lacking in open source.
GPT-3's release marked the emergence of few-shot and in-context learning, changing the paradigm of large language model research.
A series of small language models from Microsoft Research that aims to achieve strong performance with significantly less data, though Yi Tay questions whether they can truly 'cheat' scaling laws.
A model whose architecture is speculated to have pluggable experts, particularly for vision, suggesting a modular approach to multimodal capabilities.
A pre-print server where Yi Tay used to frequently browse new papers at 9:30 AM Singapore time to stay updated on research.
An early general-purpose model from Google Brain, mentioned by Yi Tay as an example of the shift towards universal foundation models, even before the general public caught up.
A coding benchmark noted as being saturated and contaminated, similar to GSM8K, making it less effective for truly evaluating new model capabilities.
Discussed as a platform for researchers to market their work and a place where misunderstandings or undue scrutiny can arise from tweets.
A company mentioned for its contributions to Mixture-of-Experts (MoE) models.
Yi Tay is the Chief Scientist at Reka, an AI company that announced a $58 million Series A funding round in June 2023, with stated goals including universal intelligence, multimodal, and multilingual agents, self-improving AI, and model efficiency.
Reka AI's co-founders, including Yi Tay, collaborated at DeepMind before starting Reka AI.
Meta has developed a strong stack for training large language models, evident in the success of the LLaMA series, particularly LLaMA 3.
A company mentioned for its contributions to Mixture-of-Experts (MoE) models.
The Hugging Face leaderboard is cited as contributing to the proliferation of lightweight, derivative open-source models that often struggle to achieve significant advancements.
A company mentioned for releasing GPT-1, which significantly influenced the direction of AI research towards general-purpose models.
A position embedding method used in Transformer architectures, appreciated for its extrapolation properties in long context models.
A feed-forward activation function from a single-author paper by Noam Shazeer that was initially obscure but gained popularity after its use in T5 1.1, and which Yi Tay considers very effective.
An architectural approach that Yi Tay is bullish on, believing it to be a promising direction in terms of flop-to-parameter ratio, offering a way to 'cheat' traditional scaling laws by having more parameters at a low flop cost. He questions its tradeoffs in capabilities.
Yi Tay's early research at Google brain focused on efficient Transformers, exploring alternatives to attention mechanisms and their broader applications.
A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.
A multimodal benchmark recently released, seen as a next step beyond MMLU but likely to have a limited lifespan before saturation.
A normalization technique used in Transformer architectures, noted as being non-controversial and a default choice in many models.
A paper on which Yi Tay advised, focusing on vision with some T5 experiments, exploring the early stages of sparse upcycling which later became relevant for MoE models.
An influential paper co-authored by Yi Tay and Jason Wei, whose central idea was primarily Jason Wei's thesis. It explored the unexpected capabilities that arise in large language models as they scale.
A paper that contested the claims of 'Emergent Abilities of Large Language Models,' arguing that some supposed emergent properties were artifacts of evaluation metrics. Yi Tay believes in emergence despite this critique.
A science fiction novel mentioned as an inspiration for describing the 'chaotic and stable phases' of compute providers during Reka AI's early training runs.
Yi Tay's alma mater, where he pursued his PhD, noting that the research community there is more focused on publishing papers than real-world impact.
Yi Tay previously worked at Google Brain for three years, where he was involved in significant AI research projects like Palm 2, UL2, and Flan.
A close collaborator and friend of Yi Tay at Google, co-author on the 'Emerging Abilities' paper, known for his marketing prowess and influential ideas in AI research.
The first author of the Flan paper, described by Yi Tay as a great engineer, and someone from whom Yi Tay learned about systematic thinking and life perspectives.
Credited as the single author of the SWIGLU paper and a co-author on a paper exploring Transformer modifications, influencing Transformer architecture choices.
Yi Tay's former manager at Google, described as a research-oriented person with good intuition and taste in identifying impactful research, and more of a friend figure than a corporate manager.
An academic mentioned by Yi Tay as someone who has discussed the shift in academic research from application-specific tasks to universal models.
Co-founder and CEO of Adept, whose discussion on screen modality and its value highly impressed the host.
The H100 GPU series experienced major delays and supply chain issues, which significantly impacted Reka AI's initial training runs due to unreliable hardware.
Google's proprietary AI accelerator. Yi Tay mentions the significant difference in TPU experience inside versus outside Google due to infrastructure and deprecated codebases.
More from Latent Space
View all 169 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free