How does reliability impact infrastructure costs?

Achieving very high reliability (e.g., five nines) often requires significant over-provisioning, meaning half the power capacity might be unused at any given time. This is a trade-off developers are increasingly willing to make for double the capacity.

Why is system balance crucial in AI infrastructure?

System balance ensures that compute (flops), memory bandwidth (HBM), and network bandwidth are appropriately matched. Without this balance, expensive compute resources can be starved for data, leading to wasted investment and low utilization.

What is Amdahl's Law and how does it apply to modern systems?

Amdahl's Law, developed in the 1960s, states that for every unit of compute, a proportional amount of I/O is required. In modern systems, this extends to network bandwidth, HBM capacity, and other components, emphasizing that compute without data is useless.

How does Google ensure network reliability in large-scale clusters?

Google uses optical circuit switches to create programmable network topologies, allowing for rapid reconfiguration or replacement of faulty racks. This significantly improves availability compared to traditional packet-switched networks, especially for synchronous computations like ML training.

What is the typical hardware depreciation period for compute hardware at Google?

Google depreciates its compute hardware over a period of 6 years, which is considered standard across the industry. Older generation chips often continue to be used effectively beyond this period due to high demand.

What are the main challenges in scaling robotics capabilities?

While scaling laws might apply, robotics applications demand extreme safety and reliability, often requiring locality and low latency. This means they may not be able to rely on distant infrastructure or handle variability, limiting the scale of compute they can effectively use.

Why is energy a primary bottleneck for AI development?

Scaling energy production and distribution to meet the immense demands of AI development globally is a massive challenge. Many current solutions are expensive, brute-force, and require significant time for implementation, making energy abundance and affordability a critical bottleneck.

What are promising directions for addressing the energy bottleneck?

Exploring wind, solar, and battery technologies is crucial, alongside more forward-looking options like space-based solar power. A portfolio approach combining proven terrestrial methods with advanced concepts is recommended, focusing on manufacturing and scaling.

Will hardware ever stop being a bottleneck for AI innovation?

Based on current trends and historical AI development (like 'The Bitter Lesson'), it's unlikely hardware will stop being a bottleneck in the foreseeable future. Even with algorithmic breakthroughs, increased compute capacity tends to be fully utilized.

How can data centers be a positive asset for local communities?

Data centers can provide uplift by creating jobs, offering access to technology, and positively impacting the local grid through smart energy management like demand response. Critical considerations include minimizing noise and water usage, prioritizing community well-being over only power efficiency.

What advice is given to students about choosing technical problems to solve?

Students should pick problem domains they are intrinsically excited about, as passion is key to sustained effort. The field is vast, with importance across algorithms, hardware, operating systems, and more, so pick what motivates you, as predicting the future is inherently difficult.

Key Moments

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

Stanford Online

Education6 min read65 min video

May 27, 2026|18,830 views|419|17

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Building AI infrastructure costs billions per gigawatt, but simply having power isn't enough; the real challenge is efficiently delivering value per unit of energy with balanced, reliable systems.

Key Insights

Building 1 gigawatt of compute infrastructure can cost approximately $40 billion, making efficient utilization crucial.

Google's node allocation is considered a major outage if less than 96% utilized, highlighting the emphasis on high utilization.

The shift towards user-centric metrics like daily active users (e.g., for Gemini) is replacing raw capacity metrics like gigawatts.

Amdahl's Law, dating back to the 1960s, still dictates system balance, requiring sufficient IO (network bandwidth today) for every unit of compute.

Newer model architectures like Mixture of Experts (MoE) require higher memory bandwidth relative to computation, impacting current hardware balance.

The lead time for new gigawatt-scale data center capacity is now 2-3 years, necessitating long-term, accurate capacity planning.

The staggering cost and critical need for efficiency in AI infrastructure

Amin Vahdat, leading Google's compute infrastructure and TPU program, emphasizes that the cost of AI infrastructure is astronomical, with 1 gigawatt of build-out estimated at around $40 billion. Given this immense investment, ensuring high utilization and effective delivery of value is paramount. Google's internal metrics reflect this, where a node allocation below 96% is considered a major outage. The focus has shifted from simply accumulating capacity (gigawatts) to maximizing the capability and value delivered to users for every unit of energy consumed. This means a gigawatt isn't just a gigawatt; its reliability and the 'good-put'—the actual computation that gets done—are critically important. If individual components fail due to poor reliability, the entire computation can halt, rendering the investment useless. Therefore, the true measure of success is 'value per dollar,' not just 'dollars per gigawatt,' implying that delivering more value with less capacity is the ultimate goal.

Shifting metrics from capacity to user value

The conversation highlights a fundamental shift in how infrastructure success is measured. Instead of focusing on the sheer number of gigawatts or TPUs deployed, the emphasis is now on tangible business outcomes and user impact. For services like Gemini, the key metric becomes daily active users (DAUs) and user satisfaction, rather than the underlying power consumption. This user-centric approach is seen as the ultimate evaluation because, in a competitive market, users will naturally gravitate towards services that provide them with value. The challenge for infrastructure designers is to build general primitives that enable these outcomes without being over-specified for any single application, aligning infrastructure capabilities with emergent user needs and market demands.

System balance and reliability are the new frontiers

Achieving high utilization and value requires more than just abundant compute; it demands a deep understanding of system balance and reliability. Vahdat likens the infrastructure challenge to building a coordinated supercomputer, where having enough compute (flops) is only one piece of the puzzle. Essential components like High Bandwidth Memory (HBM), SRAM, and network bandwidth must be in sync. He draws a parallel to Amdahl's Law from the 1960s, which stated that for every million instructions per second, a certain amount of IO (megabytes per second) was needed. Today, this translates to provisioning adequate network bandwidth for every unit of compute. Failure to maintain this balance means that vast amounts of expensive compute resources sit idle, starving for data. This is particularly relevant with new architectures like Mixture of Experts (MoE) and sparse computation, which often demand higher memory bandwidth relative to their computational ratios, disrupting traditional hardware balance points.

The changing landscape of reliability and availability

Historically, achieving five nines (99.999%) of availability was critical for enterprise-grade services, often requiring significant over-provisioning of power and redundant systems (e.g., 2N configurations), where half the capacity might be idle at any given time. However, for frontier model training, a paradigm shift is occurring. Customers are increasingly willing to trade raw reliability for increased capacity. For instance, accepting 99.9% availability (around 3.65 days of downtime per year) in exchange for double the computational resources. This is a significant departure from traditional IT infrastructure philosophies and is driven by the understanding that for large-scale, throughput-oriented tasks like training, occasional downtime is acceptable if it means more overall compute is available for the majority of the time. Novel solutions like optical circuit switches are employed to rapidly reconfigure network topologies and isolate failed racks, minimizing downtime when failures do occur.

The intricate challenge of supply chain and capacity planning

The physical nature of building vast compute infrastructure introduces significant lead times, with a 2-3 year horizon for securing new gigawatt-scale capacity. This makes accurate long-term planning extremely difficult, as over-predicting capacity leads to wasted investment, while under-predicting means leaving valuable opportunity on the table. This planning challenge is compounded by permitting processes, land acquisition, and the need for utility commitments, which often require long-term power purchase agreements. Furthermore, the supply chain for critical components like memory is a massive issue. Vendors, to avoid concentration risk from relying too heavily on a single customer, often prefer diverse customer bases, complicating procurement and leading to extended lead times.

Systems balance extends beyond compute to the entire ecosystem

System balance is not confined to computation nodes (TPUs/GPUs) but extends to the entire infrastructure ecosystem. This includes CPUs, storage, and the data center's networking fabric. Vahdat explains that achieving microarchitecture balance within a single CPU is complex; extending this to tens or hundreds of thousands of nodes, while accounting for variations in cache hits or network latency, makes perfect balance (100% MFU) practically impossible. However, striving for this balance across all interconnected components is essential to avoid bottlenecks. The supply chain itself presents another layer of challenge, with reports of memory shortages and intense competition for resources, further complicating the ability to procure balanced systems. Efforts are underway to develop technologies that can flexibly reconfigure network topologies, such as optical circuit switching, to better manage resource allocation and improve overall system availability, allowing for faster recovery from rack failures and dynamic allocation of network resources.

Hardware specialization and the enduring role of generalists

The trend in hardware design is moving towards increased specialization. While general-purpose CPUs remain essential, specialized chips like Google's latest TPUs are now being designed with distinct versions for inference (TPU 8i) and training (TPU 8T). This specialization offers significant performance boosts for specific workloads, sometimes achieving 100x efficiency compared to CPUs for targeted tasks like matrix algebra. As general-purpose CPU performance improvements have slowed, specialization becomes critical for meeting the ever-growing demand in AI. However, it's recognized that not all workloads will fit a specialized chip, and the balance of memory, compute, and networking will continue to vary, driving the need for ongoing hardware innovation and a portfolio approach to solutions, including exploring novel concepts like distributed floating data centers, orbital solar power, and enhanced wind, solar, and battery technologies for energy abundance.

Infrastructure as a community asset and responsible scaling

Vahdat stresses that scaling infrastructure should not be an 'at any cost' endeavor. The goal is 'optimal scaling,' which encompasses efficient delivery of capacity to users and, crucially, ensuring that data centers become a positive asset for their local communities. This means proactively managing impacts on noise, water, and power. For example, Google is prioritizing data center designs that use minimal water in water-scarce regions, even if it means slightly lower power efficiency. Furthermore, infrastructure providers can serve as assets to the power grid by offering demand response capabilities, allowing utilities to provision less capacity by utilizing data center downtime during peak demand periods. The overarching philosophy is to move beyond abstract gigawatt figures to tangible, beneficial local deployments that are welcomed by the community, emphasizing end-to-end thinking in capacity building.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

The primary metric is not the amount of capacity (e.g., gigawatts) but the value and capability delivered to users per dollar spent. This translates to user satisfaction and growth, like daily active users, rather than just the raw power provisioned.

Topics

AI & Machine Learning Technology & Innovation AI Scaling Supply Chain Hardware Design Power Efficiency Data Center Infrastructure Compute Capacity System Balance Reliability Engineering

Mentioned in this video

People

Amin Vahdat

Head of internal infrastructure at Google, responsible for TPUs at scale.

Jensen Huang

CEO of NVIDIA, described as a 'rapid-fire high-throughput LLM' figure, contrasted with Amin's infrastructure focus.

Elon Musk

Mentioned in context of SpaceX partnership with Anthropic.

Norm Jouppi

Stanford PhD mentioned as a potential first-principles thinker behind the decision against Ethernet for TPU supercomputers.

Sundar Pichai

CEO of Google, credited with leadership during the 'ChatGPT code red' period and for the reorg that merged Brain and DeepMind.

Eric Schmidt

Former CEO of Google, mentioned as one of the seven executives with a PhD in Computer Science when Amin Vahdat joined.

Jeff Dean

Senior figure at Google, credited for leadership during the 'ChatGPT code red' period.

Demis Hassabis

CEO of Google DeepMind, credited for leadership during the 'ChatGPT code red' period.

Dario Amodei

Mentioned in context of the SpaceX-Anthropic partnership discussion.

Richard Sutton

Turing Award winner, author of 'The Bitter Lesson' essay, which suggests throwing more computer power at problems yields better results in AI.

Companies

Google

Company where Amin Vahdat works; discussed for its internal infrastructure, TPUs, and company culture shifts.

Twitter

Social media platform where discussions about Google's computing infrastructure scale are observed.

Waymo

Autonomous driving technology company, presented as an example of advanced robotics operating in complex scenarios.

Cerebras

A company in the AI hardware space that Google is a believer in, indicating a diverse and competitive market.

TSMC

Taiwan Semiconductor Manufacturing Company, a key supplier in the semiconductor industry, discussed in the context of supply chain and vendor diversification.

SpaceX

Company involved in a partnership with Anthropic to provide compute capacity.

Anthropic

AI company partnering with SpaceX to utilize compute capacity, indicating high demand for inference compute.

Software & Apps

Gemini

A Google product enabled by TPUs, used as an example for measuring user engagement.

ChatGPT

A large language model mentioned as a competitor to Gemini and in the context of Google's 'code red' response.

Claude

A large language model mentioned as a competitor to Gemini.

LSTM

Long Short-Term Memory networks, a previously dominant algorithm for learning, contrasted with transformers.

Grok

A large language model mentioned as a competitor to Gemini.

Optical Circuit Switches

Hardware used in Google's networking to create programmable topologies and improve availability, particularly for TPUs, by allowing racks to be quickly reconfigured or replaced.

ICI

Inter-Chip Interconnect, a high-speed interconnect technology for GPUs, discussed as a factor in system balance.

Cursor

Software platform leveraging capacity on SpaceX XCI, highlighting the demand for inference compute.

NVLink

A high-speed interconnect technology developed by NVIDIA for scaling GPUs, discussed as a factor in system balance.

GPU

Graphics Processing Unit, discussed in comparison to TPUs and as a major product purchased and used by Google.

Transformers

A neural network architecture that significantly improved efficiency for AI models compared to LSTMs, discussed as a major algorithmic breakthrough.

Products

TPU

Tensor Processing Unit developed by Google, crucial for large-scale AI models like Gemini. Discussed in terms of scale, cost, and reliability.

Colossus

A Google cluster mentioned as having low utilization (11% MFU) compared to industry standards.

CPU

Central Processing Unit, part of the system balance discussion, contrasted with TPUs and GPUs.

GB200

An NVIDIA GPU model mentioned alongside other NVIDIA products.

Ethernet

A standard networking protocol, discussed as the conventional wisdom for networking which was debated and ultimately rejected for TPU supercomputers.

H100

A specific model of NVIDIA GPU in high demand, mentioned in the context of hardware usage and industry trends.

TPU 8i

Eighth-generation Google TPU specialized for inference.

Rubin

An NVIDIA product, mentioned as having been announced despite the continued demand for H100 GPUs.

H200

An NVIDIA GPU model mentioned alongside other NVIDIA products.

V200

An NVIDIA GPU model mentioned alongside other NVIDIA products.

TPU 8T

Eighth-generation Google TPU specialized for training.

Concepts

HBM bandwidth

High Bandwidth Memory, a critical component for system balance in AI hardware, discussed in relation to flops and network bandwidth.

Amdahl's Law

A principle that states the performance gain from parallelizing a task is limited by the sequential portion of the task. Applied to infrastructure, highlighting the need for balanced IO, memory, and network bandwidth relative to compute.

Mixture-of-Experts

An AI model architecture that uses sparse computation, leading to a need for higher memory bandwidth relative to computation ratios.

Torus

A network topology used in Google's TPU infrastructure, particularly for ML training with all-reduce collectives, offering reliability and efficient data dissemination.

Organizations

Berkeley

University mentioned in the context of its strong systems research and as where the speaker attended undergraduate classes.

Locations

Utah

A state where a massive gigawatt deployment mentioned as an example of an infrastructure asset for the community.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

Want to know something specific about what's covered?

Key Insights

The staggering cost and critical need for efficiency in AI infrastructure

Shifting metrics from capacity to user value

System balance and reliability are the new frontiers

The changing landscape of reliability and availability

The intricate challenge of supply chain and capacity planning

Systems balance extends beyond compute to the entire ecosystem

Hardware specialization and the enduring role of generalists

Infrastructure as a community asset and responsible scaling

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Stanford Online

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | The GPU Economy

Stanford Robotics Seminar ENGR319 | Winter 2025 | Embodied Intelligence

Stanford CS547 HCI Seminar | Spring 2026 | Promoting Agency in Human-AI Interaction

Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Economics of Generative AI

Ask anything from this episode.