What was the core of the 2024 AI scaling debate?

The central debate revolved around whether pre-training scaling of large language models has hit a 'wall.' Prominent figures like Ilya Sutskever and Juergen Schmidhuber suggested that while compute continues to scale, the availability of high-quality data at the same rate is becoming a limiting factor, necessitating new approaches like inference-time compute.

How has the market share for frontier AI models shifted in 2024?

OpenAI initially dominated with 95% market share for GPT-3.5 and GPT-4 in late 2023. However, with the launch of Claude 3 and Gemini Flash offering aggressive pricing and competitive performance, OpenAI's market share has significantly dropped to an estimated 50-75% by late 2024, indicating a very competitive three-horse race.

What differentiates 'small models' today compared to previously?

Initially, 'small' meant 'cheap'. Now, it's a more nuanced discussion, encompassing models from 0.5B to 5B parameters. Large labs like Google (Gemma) and Apple (Foundation Models) are also producing efficient small models, challenging the open-source community's traditional focus on this segment, especially for on-device and cost-effective inference.

Why is 'inference time compute' gaining importance over 'pre-training compute'?

The emphasis is shifting from pre-training compute to inference compute optimality because models are hitting scaling walls in pre-training data and cost. Focusing on inference allows labs to convert fixed pre-training costs into variable customer costs, directly linking margin to usage and enabling more flexible business models.

What were the 'Four Wars of AI' in 2024?

The 'Four Wars' included: 1) Data Quality War (publishers vs. synthetic data), 2) GPU War (GPU-poor vs. GPU-rich), 3) Multimodality War (specialized vs. god models), and 4) LLMOps/RagOps/Agent Tools War (tools for web browsing, code interpreting, memory, planning).

What is the importance of 'memory' for AI agents and its current state?

Memory is considered crucial for agents to remember past interactions and preferences over time, distinct from general knowledge retrieval. Current memory products are seen as immature, mostly limited to explicit summarization rather than implicit preference extraction or long-lived, portable memory across products.

How have AI benchmarks evolved in 2024?

Benchmarks have become more specialized and challenging. Last year's common benchmarks like MMLU and GSM8K are now saturated. New frontiers like SweetBench, LifeBench, MMU-Pro, Amy, and Frontier Math-Coding are emerging, focusing on more complex reasoning, multimodal, and robust evaluation tasks, indicating rapid progress in AI capabilities.

What key AI capabilities migrated from emerging to mature in 2024?

In 2024, general knowledge (MMLU), long context (128k+ tokens), basic RAG, badge transcription (Whisper), and code generation moved into the 'mature' category, meaning they are now widely reliable for production use. Vision language models and structured output are rapidly emerging.

What were some of the biggest AI product launches and news in early 2024 (Jan-May)?

Key events included Perplexity AI's funding round backed by Jeff Bezos, ChatGPT releasing memory, Claude 3's launch significantly shifting market share, Devin's high-profile but controversial agent launch, AI music generation tools like Suno and Udio, Llama 3's open-source release, and GPT-4o's multimodal 'omni-model' unveiling that included vision and voice.

What were major AI advancements in the second half of 2024 (June-Dec)?

Highlights include Apple Intelligence rolling out on iPhones, NVIDIA's continued growth, XAI's shift with Dan Grosa as CEO, OpenAI's o1 model release (Strawberry/Qwen) with structured output, the general availability of ChatGPT Voice Mode, and the launch of ChatGPT Canvas demonstrating OpenAI's foray into document editing, showcasing fierce competition against Google.

What is the prediction for AI in 2025 regarding job skill floors?

The prediction is that 2025 will be the first year where AI sets the skill floor for many jobs. As AI agents become more capable (e.g., customer support, junior software engineering), humans in these roles will need to be better than the AI to remain economically viable.

Key Moments

2024 Year in Review: The Big Scaling Debate, the Four Wars of AI, Top Themes and the Rise of Agents

Latent Space Podcast

Science & Technology4 min read112 min video

Jan 1, 2025|3,851 views|88|16

Save to Pod

Key Moments

TL;DR

2024 AI Recap: Scaling debate, AI wars (data, multi-modality), agents, and new benchmarks defined.

Key Insights

The AI engineering field has rapidly matured, with a growing demand for skilled engineers to productionize research.

A significant debate emerged regarding the limits of scaling large language models, with consensus shifting towards the need for new approaches beyond just larger pre-training.

The "Four Wars of AI" (data, autonomy, multimodality, and inference) characterized the competitive landscape, highlighting shifts in market share and emerging capabilities.

The rise of AI agents and their integration into workflows is a major theme, with significant progress and anticipation for their widespread adoption in 2025.

The cost of AI intelligence has dramatically decreased, with significant improvements in efficiency and pricing, especially for smaller models and optimized inference.

New benchmarks and evaluation metrics are constantly being developed to keep pace with the rapid advancements in AI capabilities, particularly in areas like reasoning and multimodal understanding.

THE EVOLUTION OF AI ENGINEERING AND THE SCALING DEBATE

The podcast celebrates its 100th episode, reflecting on the explosive growth of AI engineering as a field. Initially a niche concept, AI engineering has become a recognized discipline, evidenced by its placement on Gartner's hype cycle and its prominence in industry discussions. A major theme of 2024 was the "scaling debate," questioning whether simply increasing model size and compute is still the most effective path forward. Several prominent researchers, including Ilia Sutskever, suggested that pre-training and data scaling might be hitting a wall, signaling a potential shift towards more efficient training methods and inference optimization. This marked a departure from the previous year's focus on raw scaling.

THE FOUR WARS OF AI: DATA, MULTIMODALITY, AND MARKET SHIFTS

The year was defined by several key competitive fronts, dubbed the 'Four Wars of AI.' The "data war" saw debates around data quality, synthetic data generation, and the ethical implications of using copyrighted material for training. The "multimodality war" heated up with significant advancements in video generation (Sora, Gen-2), image editing, and the integration of various modalities within single models (like Gemini 2.0). Market share also saw a notable shift, with OpenAI's dominance being challenged by Anthropic and Google's Gemini, particularly in the lower-cost inference tiers. The rise of smaller, more efficient models from large labs also became a significant trend, countering expectations of open-source dominance in this area.

THE ASCENSION OF AI AGENTS AND THEIR INTEGRATION

AI agents emerged as a central focus for 2025, building on the discussions from 2024. While some predicted 2024 to be the year of agents, the consensus is that their widespread productionization and adoption will truly ramp up in the coming year. Key challenges and research areas for agents include learning from the environment, extracting implicit business processes, and developing better instruction-following capabilities. The development of robust agent tooling, such as specialized SDKs, memory systems, and code interpretation capabilities, is seen as crucial for their success. Companies like Stripe and DeepMind are actively developing agentic systems, signaling strong industry belief in their potential.

INFERENCE OPTIMIZATION AND THE "GPU RICH" ECONOMY

The economics of AI shifted significantly, with a dramatic decrease in the cost of inference. While startups that relied on massive GPU clusters (GPU Rich) faced funding challenges, the "GPU Ultra Rich" labs continued massive investments. The efficiency gains have made AI capabilities more accessible, driven by optimized models and hardware. This trend also impacts the GPU market, with prices stabilizing as demand shifts towards more efficient solutions. The debate also touched on how consumers and businesses will access and utilize these capabilities, with a growing emphasis on on-device models and cost-effective cloud solutions.

ADVANCEMENTS IN BENCHMARKING AND EVALUATION

As AI capabilities rapidly evolve, so too do the methods for evaluating them. The year saw a shift away from older benchmarks like MMLU towards more specialized and rigorous evaluations in areas like reasoning, coding, and multimodal understanding. New benchmarks like SweetBench, LifeBench, and Amy emerged, reflecting the cutting edge of AI research. ThedA discussion highlighted the saturation of some benchmarks and the need for continuous innovation in evaluation methodologies to accurately capture the progress of frontier models and the narrowing gap between open-source and closed-source AI.

EMERGING CAPABILITIES AND THE FUTURE FRONTIER

The discussion explored the landscape of emerging AI capabilities, categorizing them into 'mature,' 'emerging,' and 'frontier.' Mature capabilities include general knowledge, improved long-context windows, and robust code generation. Emerging areas like vision-language models, real-time transcription, and sophisticated thinking processes are becoming more integrated into products. Frontier capabilities, such as advanced real-time voice interaction, on-device models, and more complex multimodal integration (video-to-audio sync), are on the horizon. The year also saw continued progress in areas like synthetic data, state-space models, and the development of robust tooling for agents, setting the stage for further breakthroughs.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

LLM Performance and Pricing Evolution (2023-2024)

Data extracted from this episode

Model	Launch Period	Elo Score (Approx.)	Price per Million Tokens (Approx.)	Order of Magnitude Improvement (from Jan 2024)
GPT-4 2023	Early 2023	1175	$40-$50	Baseline
Claude 3 Haiku	March 2024	1175	$0.50	2 orders
Gemini 1.5 Pro	July/August 2024 (price cut)	1250+	$5	1 order
Amazon Nova (Pro, Light, Micro)	Recent	1200-1300	$0.075	3 orders

Common Questions

The 'AI engineer' role has gained significant traction, moving from a niche concept to a recognized field now topping Gartner's hype curve. It's defined by applied AI, using models in production without necessarily needing a PhD, focusing on integrating research findings into practical applications.

Topics

Ai Agents AI & Machine Learning Technology & Innovation Business & Entrepreneurship AI Scaling Large Language Models AI Engineering Multimodal AI AI Industry Trends AI Benchmarks GPU Investment

Mentioned in this video

Products

Whoop band

A wearable fitness tracker, not directly mentioned in the content but provided as an example in prompt for entity extraction. No specific discussion in the transcript.

Bolt

A company that announced $20 million ARR, another example of a GPU-poor startup achieving significant growth, focusing on web container technology.

NVIDIA Blackwell

NVIDIA's next-generation GPU series slated for release, expected to accelerate AI development and maintain NVIDIA's market dominance.

An open-source inference model from Fireworks, along with Quill from the Qwen team, noted as leading contenders in the inference space.

People

Jeff Bezos

Backed Perplexity AI, an important endorsement from a tech luminary who historically backed Google, suggesting Perplexity's significant potential.

Jensen Huang

NVIDIA's CEO, praised XAI for efficiently spinning up a large GPU cluster, showcasing the trend of GPU-rich investments.

Andy Kaczynski

Launched the Kaczynski Prize with a metric similar to SweetBench but arguably more useful, showing attempts to create better evaluations for AGI.

Jürgen Schmidhuber

OG in AI and creator of the LSTM, cited as another prominent figure stating that pre-training scaling has hit a wall or run into a different kind of wall.

Dan Grosa

Seems to have become the full-time CEO of XAI, focusing on shipping a single path to superintelligence, contrasting with OpenAI's approach of intermediate products.

Ilya Sutskever

OpenAI's chief scientist who publicly stated that pre-training and data scaling have hit a wall, supporting the 'no scaling' argument in a debate.

Lee Laoped

Author of an essay on the trillion-dollar cluster, contributing to the discussion on the massive GPU investments by major AI labs.

Karina Nguyen

The person reportedly responsible for Artifacts and ChatGPT Canvas, who moved from Anthropic to OpenAI, a rare 'reverse move' in the industry.

Sam Altman

CEO of OpenAI, mentioned discussing GPT-5 in January, but ultimately OpenAI shipped other models like GPT-4o instead that year.

Scarlett Johansson

Her voice was controversially mimicked by OpenAI's 'Sky Voice' in GPT-4o demos, leading to public backlash and the feature's removal.

Companies

Morph Labs

Launched a 'time travel VM,' addressing the need for statefulness in AI agents, allowing unwinding or forking to explore different execution paths.

Ramp

A financial software company whose data is used to attribute market share shifts among large language models, indicating OpenAI's initial dominance and subsequent decline in market share.

CodeSandbox

A platform for front-end development, acquired by Netlify, and now offering its capabilities as an API, further enabling code interpreting for AI.

Anthropic

A frontier AI lab, part of the three-horse race (with Gemini and OpenAI), which has aggressively gained market share, particularly with the launch of Claude 3 and 3.5 Sonnet.

Cerebras

A company happy to serve large models like LLaMA 405B on their super large chips, though custom use is constrained.

DeepMind

Highlighted for its extensive background work in video modeling, including Genie, Gen 2, and VideoPoet, giving it an advantage in world modeling compared to other labs.

Sunoo

A GPU-poor startup that rated as one of the fastest-growing companies, achieving 0 to $20 million ARR by training on Modo, showcasing success without owning massive GPU infrastructure.

MosaicML

Acquired by Databricks for stock, initially valued around $2 billion; its valuation is now estimated to be significantly higher.

Modo

The platform Sunoo uses for its training, demonstrating how startups leverage GPU clouds to become successful without direct GPU ownership.

Databricks

A company that made the largest venture round in history, valuing it at $10-10.10 billion, and later acquired Mosaic for over $2 billion, demonstrating significant investment and consolidation in the AI/data space.

Software & Apps

Llama.cpp

An open-source project by Georgi Gerganov that created bottom-up standardization, demonstrating community-driven protocol development in AI.

Claude 3 Haiku

An Anthropic model that achieved the same Elo as GPT-4 2023 but at a drastically reduced price of 50 cents per million tokens, representing a two-order-of-magnitude improvement.

Suno

An AI music generation tool that was highly praised and used for podcast intro songs, highlighting the creative applications of AI.

Hacker News

Mentioned as a platform where the concept of AI engineering was initially met with skepticism, serving as an indicator of its growing acceptance elsewhere.

Sora

A text-to-video model that was recently released by OpenAI, generating significant excitement but also highlighting the challenges in access and stability.

LangChain

A framework for developing applications powered by language models, noted for its continued growth in downloads and usage, distinguishing itself from projects with high stars but low practical adoption.

AutoGPT

A rapidly growing GitHub project known for over-promising generality (e.g., 'make me money'), leading to broad interest but low usage due to lack of focus and execution challenges.

ChatGPT Canvas

A document editing environment by OpenAI, released as part of their 12 Days of Shipmas, which supersedes Code Interpreter by allowing code writing and execution with better AI integration.

Gemma

A small foundation model from Google's Gemini, part of the 1 to 5 billion parameter size focus of small models.

Recraft V3

An image model that unexpectedly rose to be a top performer in the image arena by Artificial Analysis, surpassing established models like Flux 1.1.

CrewAI

An agent framework mentioned as having growing stars (GitHub likes) but flat usage, illustrating the gap between hype and practical adoption in some AI projects.

E2B

A company whose fundraising was soft-announced, operating in the code interpreting space by providing sandboxed environments for LLMs to run code.

Perplexity AI

An AI-powered answer engine, noted for having ITB (E2B) as a customer and for its maturing products that do complex tasks like producing financial charts.

Code Interpreter

An older OpenAI tool that was popular last year but has now been largely superseded by ChatGPT Canvas, which offers more advanced capabilities.

MMU Pro

A benchmark for multimodal AI models, highlighted as critical for evaluating frontier capabilities.

Gemini Nano

Google's on-device AI model, coming to Chrome with feature flags, indicating a push for GPU-poor friendly, local AI capabilities.

Magic.dev

Made waves teasing a 100 million-token model, contributing to the trend of expanding context windows in LLMs.

Stripe Agent Toolkit

An SDK wrapper on the Stripe API, intended to support agents, demonstrating that even non-AI companies are building tools for agent integration, indicating demand and belief in agents.

ChatGPT

A popular conversational AI model, increasingly viewed as a robust platform for AI agent development due to new features like Canvas and voice mode.

OpenRouter

A platform mentioned as an indicator of Gemini Flash's market share, with 50% of its requests going to Gemini Flash due to its aggressive pricing.

Apple Foundation Models

Foundation models from Apple, around 3 billion parameters, categorized as small models.

Grok's Aurora

Grok's own image generation model, launched after an initial partnership with Black Forest Labs; marks their foray into proprietary image generation.

LlamaIndex

A data framework for LLM applications, observed to be consistently growing in downloads, indicating real usage and commercial product stickiness.

Devin

An AI coding agent launched in March with a highly effective PR campaign but faced backlash over video realism and took 9 months to reach general availability, raising questions about its deliverability.

JWS

DeepMind's code agent, complementing their browser agent and other AI initiatives, signaling their focus on sophisticated agent development.

Anthropic's Model Context Protocol

A memory implementation released by Anthropic with only 300 lines of code, simple but effective, suggesting that core memory components might be handled by large labs.

ComfyUI

Another example of a community-driven standardization project from 'Comfy Anonymous,' highlighting alternative paths to standardization beyond large labs.

Apple Intelligence

Apple's AI rollout for iPhones, a 3B parameter foundation model running locally with hot-swappable LoRAs, demonstrating a focus on on-device AI.

RWKV

A type of state-space model that is being rolled out in Windows, contributing to the trend of on-device AI and alternative model architectures.

Claude 3

Anthropic's model release in March, which significantly shifted market share from OpenAI and established Claude as a strong frontier model family.

ChatGPT memory

Released in February, allowing ChatGPT to remember user conversations, indicating early steps toward better context retention.

Dropbox Dash

A search tool from Dropbox with Google Drive integration, highlighting how traditional cloud storage providers are evolving into AI-powered search solutions.

Amazon Nova

Amazon's series of LLMs (Pro, Light, Micro) that are offering efficient frontiers for intelligence levels between 1200 and 1300 Elo, significantly driving down the cost of intelligence.

Artifacts App

Mentioned as a code-oriented equivalent to ChatGPT Canvas, designed for code interpretation and execution.

Udio

An AI music generation tool, similar to Suno, used to create songs and celebrated for its creative capabilities, demonstrating AI's impact on music production.

Llama 3

An open-source frontier model released in April, delivering on expectations with its 8B and 70B variants, making high-quality models accessible.

ChatGPT Voice Mode

A generally available feature in ChatGPT that allows users to converse with the AI using voice, highlighting the maturity of voice interaction in AI.

GPT-4o

Released in May, an 'omni-model' with native vision and voice capabilities, highly impactful for its efficiency and multimodal demos, despite the 'Sky Voice' controversy.

Project Mariner

DeepMind's browser agent, a 'computer use type thing,' demonstrating their active development in agentic capabilities.

NotebookLM

A product by Google, cited as one of the most popular podcast episodes, praised for its timeliness and excellent guests, achieving widespread social media attention.

Books

Chinchilla paper

A paper on compute-optimal training, but it's noted that it specifically refers to pre-training compute optimal training, highlighting a shift towards inference compute optimality.

Studies & Research

Sweetbench

A leading benchmark for AI models, with specific focus on 'SweetBench Verified' and 'SweetBench Multimodal,' indicating evolving metrics for frontier models.

Concepts

Apple Secure Cloud Compute

Apple's announced secure cloud computing for its AI models, highlighting growing interest in secure and private AI inference for state-level interests.

LM Elo

A leaderboard or metric for language model performance, mentioned as significantly increasing across all models in 2023, indicating rapid innovation and competition.

Google Brain Compute Marketplace

A concept discussed for concentrating compute resources, emphasizing that one big training run is more valuable than many small ones.

LifeBench

A new benchmark mentioned alongside SweetBench and MMU-Pro as the latest metrics for evaluating frontier AI models.

Amy benchmark

A benchmark for AI models, especially for reasoning and coding tasks, indicating the shift in focus for evaluating advanced AI capabilities.

Organizations

Qwen Team

Affiliated with Alibaba, their Quill model is noted as a leading contender in the open-source inference space, demonstrating contributions to new model architectures.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free