How did Yi Tay become a co-lead for the Palm 2 modeling workstream at Google Brain?

His involvement in Palm 2 stemmed from the success of his independent UL2 project, which gained internal visibility at Google. Leads were chosen through a blend of top-down and bottom-up decisions, and his contributions made him an obvious choice for co-leading the modeling stream.

What were the key takeaways from Yi Tay's collaborations with Jason Wei?

Yi Tay learned the importance of marketing and public relations for research from Jason Wei, who excelled at promoting his work. He also absorbed Jason's philosophy of standing by one's opinions, even spicy ones, if they are deeply held.

What is Yi Tay's advice for researchers without an existing online presence to market their work?

He advises collaborating with visible mentors or co-authors who can vouch for your work and share it with their audience. This approach leverages their existing network and reputation to gain visibility for your contributions, fostering growth from a smaller starting point.

What qualities does Yi Tay identify in high-performing AI researchers, beyond technical skills?

Beyond technical skills, high-performing AI researchers demonstrate intense passion, dedication, and a willingness to go above and beyond, even working at unusual hours when problems arise. This innate drive, rather than forced effort, is crucial for making rapid progress, though he acknowledges it can be unhealthy long-term.

What challenges did Reka AI face regarding GPU supply during its early training runs?

Reka AI experienced significant delays in receiving H100 GPUs and found that many nodes were initially broken or unreliable. This unpredictability led to substantial time loss and increased effort, highlighting the need for compute providers to share the risk of node failures.

Why did Reka AI choose not to use TPUs outside of Google, despite Yi Tay's familiarity with them?

TPU infrastructure and codebases outside Google were not as mature or readily available, especially for the custom JAX stack Yi Tay was familiar with. Migrating to TPUs would have been too costly and disruptive given their established GPU-based setup and existing hardware commitments.

What is Yi Tay's perspective on the GQA, SWIGLU, and RoPE architectural components in modern Transformers?

He views GQA as a practical improvement over MQA for inference, SWIGLU as a highly effective but initially underrated activation function, and RoPE as a good default for positional embeddings, particularly useful for extrapolation in long context models.

What are the fundamental differences and advantages of encoder-decoder models compared to decoder-only models?

Encoder-decoder and decoder-only models are fundamentally auto-regressive. The main benefit of encoder-decoders is 'intrinsic sparsity,' where a 20B encoder-decoder model has a similar flop cost to a 10B decoder-only model, effectively getting "free sparsity." This architecture also allows for more flexible and efficient processing of long contexts in the encoder.

How does Yi Tay assess open-source large language models and benchmarks like MMLU?

He sees LLaMA 3 as the first truly legitimate open-source model, indicating Meta's strong training stack. He believes MMLU and other traditional benchmarks are becoming saturated and 'contaminated,' less useful for genuine evaluation, and that labs need to develop new, unreleased evaluation sets.

What are Yi Tay's thoughts on the future trend of multimodal models: late vs. early fusion, and screen intelligence?

He believes early fusion is always better when possible, though late fusion persists due to organizational artifacts and resource constraints. He expects models to trend towards generalized 'omnimodality' where they perform well across both natural images and screen-based interfaces, as humans do.

What are the advantages of long context models compared to Retrieval Augmented Generation (RAG)?

While RAG is cheaper and suitable for many tasks, long context models are crucial for complex applications like understanding elaborate storybooks where information fragmentation is an issue. They allow the model to process a vast amount of information holistically for higher-quality, albeit more expensive, decisions.

What is Yi Tay's perspective on Mixture-of-Experts (MoE) architectures?

Yi Tay is bullish on MoE models, considering them a way to improve the flop-to-parameter ratio, effectively 'cheating' traditional scaling laws. He sees them as a viable architectural direction but is interested in further understanding the capability tradeoffs this architecture introduces.

What are the key differences in AI research culture between Singapore and the US?

Yi Tay experienced a significant culture shock, finding Singapore's academic research more focused on publishing papers than creating real-world impact, in contrast to the US, which he perceives as highly impact-driven.

What general advice does Yi Tay have for governments like Singapore looking to foster AI talent and build a local AI ecosystem?

He suggests a 'less is more' approach, cautioning against creating artificial structures that could accelerate efforts in the wrong direction. The most important step is to genuinely respect and nurture local talent, rather than excessive intervention.

Key Moments

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Latent Space Podcast

Science & Technology4 min read139 min video

Jul 5, 2024|3,495 views|88|8

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Yi Tay of Reka discusses LLM research, Reka's models, Google Brain experience, and AI trends.

Key Insights

Reka Core's success demonstrates that smaller, well-funded labs can compete with giants in LLM development.

The transition in AI research has shifted from task-specific fine-tuning to large, general-purpose foundation models.

Yi Tay's career path highlights prioritizing impactful and promising research over rigidly planned trends.

Compute reliability and efficient orchestration are significant operational challenges in LLM training.

Multi-expert (MoE) architectures are a promising direction for balancing performance and computational cost.

The debate between open-source and closed-source LLMs is complex, with incentives playing a major role.

JOURNEY FROM GOOGLE BRAIN TO REKA

Yi Tay's illustrious career began at Google Brain, where he co-led PaLM 2 and invented UL2, significantly contributing to models like Flan and the bcore team. His transition to Reka in March 2023, following a $58 million Series A in June 2023, signals a strategic move to build universal, multimodal, and multilingual intelligence agents. Reka's rapid model releases, including Flash, Core, and Edge, highlight their ambitious goals for self-improving AI and model efficiency, underscoring a sharp focus on impactful research.

EVOLUTION OF LLM RESEARCH PARADIGMS

Tay observes a significant shift in machine learning research, moving from task-specific fine-tuning of models like T5 and BERT to the current paradigm of large, general-purpose foundation models. He notes that while the underlying principles of Transformer architecture and research haven't fundamentally changed, the scale of compute and data has dramatically increased. This evolution, accelerated by events like the ChatGPT launch, has redefined AI research goals towards universal intelligence rather than domain-specific optimizations.

STRATEGIC CAREER GROWTH AND RESEARCH PHILOSOPHY

Reflecting on his career, Tay emphasizes a philosophy of optimizing for impact and promising research rather than proactively chasing trends. His involvement in PaLM 2's development stemmed organically from the success of his personal project, UL2. This approach, combined with strong collaborations and an open-minded adaptability to the rapidly shifting field, allowed him to navigate complex research landscapes and contribute to significant breakthroughs.

CHALLENGES IN LLM INFRASTRUCTURE AND TRAINING

Building state-of-the-art LLMs at Reka highlighted significant operational challenges, particularly around compute reliability. Tay describes the frustrating experience of GPU delays and unreliable hardware, which significantly impacted training runs. The decision to use GPUs over TPUs was based on familiarity and existing infrastructure. He stressed that compute providers must offer better risk-sharing models, as brittle infrastructure can devastate startups by wasting precious training time and resources, impacting work-life balance due to the constant anxiety.

ARCHITECTURAL INNOVATIONS AND MODEL DESIGN

Reka's models, such as Reka Core and Flash, incorporate advanced architectural choices, including aspects of the 'Gnome Architecture' characterized by gated linear units (GLU variants like SwiGLU), grouped query attention (GQA), and RoPE embeddings. Tay finds GQA a no-brainer for inference benefits and appreciates RoPE for its extrapolation properties. He also discusses the nuanced benefits of encoder-decoder architectures, noting their intrinsic sparsity that allows for greater parameter efficiency compared to decoder-only models at the same compute budget, especially when dealing with multimodal inputs.

THE MULTIMODAL FRONTIER AND EVALUATION CHALLENGES

The AI field is increasingly trending towards early fusion in multimodal models, integrating different modalities from the outset for deeper understanding, a direction Reka and players like OpenAI (GPT-4o) are pursuing. Tay believes that while late fusion will persist due to practical constraints, early fusion represents the more robust long-term approach. He also notes the critical need for better evaluation benchmarks, especially for long-context models and multimodal capabilities, to guide progress effectively and avoid the contamination plaguing current benchmarks.

SCALING LAWS, EFFICIENCY, AND FUTURE TRAJECTORIES

Tay views Chinchilla's scaling laws as a misunderstood guideline rather than a strict limit, noting that models like LLaMA 3 have significantly surpassed them. He advocates for a holistic view of efficiency, considering not just active parameters or theoretical FLOPs but also practical throughput, inference speed, and serving costs. He is bullish on Mixture-of-Experts (MoE) architectures for their favorable compute-to-parameter ratio, believing they are a key enabler for continued scaling, though the nuances of their impact on capabilities beyond performance metrics remain an active research question.

THE OPEN VERSUS CLOSED-SOURCE DEBATE

Tay draws a distinction between open-weight models released by large labs (like Meta's LLaMA) and grassroots, bottom-up open-source efforts. While not inherently against open source, he observes that many community-driven initiatives often rely on rebranding existing models or quick wins through fine-tuning, which may not lead to fundamental advancements. He suggests that a lack of substantial reward signals for purely derivative open-source work could limit its long-term impact compared to the sustained, resource-intensive development seen in closed-source models.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Yi Tay's transition happened organically as the field evolved. He optimized for impactful and promising areas, collaborating with influential people like Jason Wei. The widespread adoption of GPT-3 and ChatGPT significantly shifted the research meta from task-specific fine-tuning to universal foundation models.

Topics

AI & Machine Learning Business & Entrepreneurship Large Language Models Talent Acquisition Model Architecture Career Development Scaling Laws AI Research Startup Challenges Career & Skills AI Efficiency

Mentioned in this video

Software & Apps

Palm 2

Yi Tay was the architecture co-lead on Palm 2 at Google Brain, a significant company-wide effort in large language model development.

Flan

Yi Tay was a core contributor to Flan at Google Brain, a project mainly led by Hieu Pham and Sharan Narang focusing on instruction tuning.

BERT

Mentioned as an example of models that researchers were fine-tuning in late 2019, before the focus shifted entirely to foundational models.

Reka Flash, Reka Core, Reka Edge

Models released by Reka AI, which achieved state-of-the-art results even against larger models from bigger labs, showcasing the team's ability to achieve high performance with efficient cycles.

PaLM

Yi Tay contributed to Palm 1 and was a co-lead for the modeling workstream of Palm 2.

Mamba

Hypothetical example of a Transformer alternative that may show good performance at small scales but whose implications for larger models are unknown.

UL2

Yi Tay is the inventor of UL2, which he started as a personal project during a break and became the largest encoder-decoder model released by Google at the time.

ChatGPT

The launch of ChatGPT in November 2022 drastically changed the AI research landscape, making previous task-specific work largely obsolete and accelerating the focus on general-purpose models.

Kubernetes

Used by Reka AI for some orchestration tasks, but Yi Tay notes that generalized orchestration tools for ML experimentation are still lacking in open source.

GPT-3

GPT-3's release marked the emergence of few-shot and in-context learning, changing the paradigm of large language model research.

Phi-1, Phi-2, Phi-3

A series of small language models from Microsoft Research that aims to achieve strong performance with significantly less data, though Yi Tay questions whether they can truly 'cheat' scaling laws.

GPT-4

A model whose architecture is speculated to have pluggable experts, particularly for vision, suggesting a modular approach to multimodal capabilities.

arXiv

A pre-print server where Yi Tay used to frequently browse new papers at 9:30 AM Singapore time to stay updated on research.

An early general-purpose model from Google Brain, mentioned by Yi Tay as an example of the shift towards universal foundation models, even before the general public caught up.

HumanEval

A coding benchmark noted as being saturated and contaminated, similar to GSM8K, making it less effective for truly evaluating new model capabilities.

Companies

Twitter

Discussed as a platform for researchers to market their work and a place where misunderstandings or undue scrutiny can arise from tweets.

DeepSeek

A company mentioned for its contributions to Mixture-of-Experts (MoE) models.

Reka AI

Yi Tay is the Chief Scientist at Reka, an AI company that announced a $58 million Series A funding round in June 2023, with stated goals including universal intelligence, multimodal, and multilingual agents, self-improving AI, and model efficiency.

DeepMind

Reka AI's co-founders, including Yi Tay, collaborated at DeepMind before starting Reka AI.

Meta Platforms

Meta has developed a strong stack for training large language models, evident in the success of the LLaMA series, particularly LLaMA 3.

Snowflake Inc.

A company mentioned for its contributions to Mixture-of-Experts (MoE) models.

Hugging Face

The Hugging Face leaderboard is cited as contributing to the proliferation of lightweight, derivative open-source models that often struggle to achieve significant advancements.

OpenAI

A company mentioned for releasing GPT-1, which significantly influenced the direction of AI research towards general-purpose models.

Concepts

RoPE

A position embedding method used in Transformer architectures, appreciated for its extrapolation properties in long context models.

SWIGLU

A feed-forward activation function from a single-author paper by Noam Shazeer that was initially obscure but gained popularity after its use in T5 1.1, and which Yi Tay considers very effective.

Mixture-of-Experts

An architectural approach that Yi Tay is bullish on, believing it to be a promising direction in terms of flop-to-parameter ratio, offering a way to 'cheat' traditional scaling laws by having more parameters at a low flop cost. He questions its tradeoffs in capabilities.

Transformer

Yi Tay's early research at Google brain focused on efficient Transformers, exploring alternatives to attention mechanisms and their broader applications.

GSM8K

A benchmark for mathematical reasoning that, along with others like HumanEval, is considered 'contaminated' or saturated, meaning high scores are no longer truly indicative of breakthrough performance.

MM-Pro

A multimodal benchmark recently released, seen as a next step beyond MMLU but likely to have a limited lifespan before saturation.

RMS Norm

A normalization technique used in Transformer architectures, noted as being non-controversial and a default choice in many models.

Books

Sparse Upcycling

A paper on which Yi Tay advised, focusing on vision with some T5 experiments, exploring the early stages of sparse upcycling which later became relevant for MoE models.

Emergent Abilities of Large Language Models

An influential paper co-authored by Yi Tay and Jason Wei, whose central idea was primarily Jason Wei's thesis. It explored the unexpected capabilities that arise in large language models as they scale.

The Mirage of Emergent Abilities

A paper that contested the claims of 'Emergent Abilities of Large Language Models,' arguing that some supposed emergent properties were artifacts of evaluation metrics. Yi Tay believes in emergence despite this critique.

The Three-Body Problem

A science fiction novel mentioned as an inspiration for describing the 'chaotic and stable phases' of compute providers during Reka AI's early training runs.

Organizations

National University of Singapore

Yi Tay's alma mater, where he pursued his PhD, noting that the research community there is more focused on publishing papers than real-world impact.

Google Brain

Yi Tay previously worked at Google Brain for three years, where he was involved in significant AI research projects like Palm 2, UL2, and Flan.

People

Jason Wei

A close collaborator and friend of Yi Tay at Google, co-author on the 'Emerging Abilities' paper, known for his marketing prowess and influential ideas in AI research.

Hieu Pham

The first author of the Flan paper, described by Yi Tay as a great engineer, and someone from whom Yi Tay learned about systematic thinking and life perspectives.

Noam Shazeer

Credited as the single author of the SWIGLU paper and a co-author on a paper exploring Transformer modifications, influencing Transformer architecture choices.

Quoc Le

Yi Tay's former manager at Google, described as a research-oriented person with good intuition and taste in identifying impactful research, and more of a friend figure than a corporate manager.

Sasha Rush

An academic mentioned by Yi Tay as someone who has discussed the shift in academic research from application-specific tasks to universal models.

David Luan

Co-founder and CEO of Adept, whose discussion on screen modality and its value highly impressed the host.

Products

H100

The H100 GPU series experienced major delays and supply chain issues, which significantly impacted Reka AI's initial training runs due to unreliable hardware.

TPU

Google's proprietary AI accelerator. Yi Tay mentions the significant difference in TPU experience inside versus outside Google due to infrastructure and deprecated codebases.

Locations

Singapore

Yi Tay's home country, whose research community he sees as less impact-driven compared to the US, and for which he offers advice on fostering AI talent.

USA

Contrasted with Singapore, the US research community is seen as impact-driven, influencing Yi Tay's perspective on research culture.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free