What is Brev and how does it help developers?

Brev is a developer tool designed to simplify GPU access by connecting multiple GPU sources and providing easy SSH access into GPUs, with a focus on one-click deployment concepts for developers.

What is Dynamo and what problem does it solve?

Dynamo is NVIDIA's data-center-scale inference engine that sits on top of existing VLM/SANG and Tensor/HL models to accelerate inference by enabling scale-out across multiple nodes and leveraging caches like KV. It also introduces techniques like disagregation.

What is OpenClaw and how does NVIDIA view its security implications?

OpenClaw is mentioned as a sandbox/open-access platform for running agent-enabled workloads, with NVIDIA emphasizing security review and isolation (e.g., running in Brev) to protect corporate networks and data.

What does Grove do in Dynamo's architecture?

Grove is a Kubernetes-based component in Dynamo that enables scaling specialization, i.e., dynamic allocation of prefill and decode workers depending on workload, instead of a static leader-wroker pattern.

What is Reuben CPX and why is it relevant to Dynamo?

Reuben CPX is a prefill-specific hardware accelerator designed to speed up the 'prefill' phase of large-model inference, addressing compute-bound prefill workloads.

How does Dynamo handle long-context and scaling challenges?

Dynamo introduces a modular framework with knobs to adjust deployments (e.g., KV cache utilization, disaggregation) and supports scale-out across data-center clusters, addressing the limits of scaling up replicas.

What is the 'zero-billion-dollar market' concept used by Jensen at NVIDIA?

The idea is that NVIDIA is willing to invest in markets that today may be zero-billion-dollar in revenue but have potential for future importance; investments are focused on future opportunities rather than immediate revenue.

Where can you find Dynamo-related sessions for GTC?

GTC's online schedule lists Dynamo-related sessions, including a Dynamo tutorial and talks about the future of agents in production inference.

What is a 'harness' in the context of agents and LLMs?

A harness is a structured set of rules or a schema that imposes structure on the model's context, improving cache utilization and guiding calls; this is critical for production inference with agents.

What is the relationship between Docker-like CLI tools and LLMs in this space?

CLI tools (and a terminal-based workflow) are highlighted as a practical, portable, and secure method for interacting with GPUs and cloud resources; LLMs can leverage CLI for robust, predictable automation.

Key Moments

NVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"

Latent Space Podcast

Science & Technology4 min read86 min video

Mar 8, 2026|6,141 views|120|9

Save to Pod

Key Moments

TL;DR

NVIDIA scales AI agents with Brev and Dynamo, balancing UX, security, and inference.

Key Insights

Limit agent capabilities to reduce risk: agents should not have all three capabilities (files, internet, code) at once; two is safer and more controllable.

Brev redefines GPU access: one-click, SSH-friendly access to GPUs (including open home setups like Spark), fundamentally simplifying developer UX and speeding provisioning.

OpenClaw and agent tooling drive automation but require robust safeguards: sandboxed tool use in the cloud is favored over corporate networks to limit security exposure.

Dynamo enables data-center-scale inference: scale-out, disaggregation of prefill and decode, and intelligent KV caching unlock faster, cheaper, and more scalable inference across multiple models.

Developer experience is central at NVIDIA: a culture of hands-on engineering, rapid prototyping, and a willingness to explore ‘zero-dollar’ markets to push new AI capabilities.

Real-world deployment requires balancing quality, cost, and latency: models, workflows, and SLAs must be tuned to hit practical targets for production use.

BREV: ONE-CLICK GPU ACCESS AND DEV UX

Brev emerges as a developer-first gateway to GPUs, dramatically lowering the barrier to entry for working with high-end hardware. The vision is simple but profound: give developers a near-instant SSH pathway into a GPU, and frame the GPU in terms users actually care about—prominent, intuitive prompts that describe what they want rather than wrestling with opaque provisioning pages. The team recounts crafting GPUs as prominent UI elements, with large chips and vivid animations, to make the hardware feel approachable and immediately usable. This UX philosophy extends to the Spark home scenario, where Brev can manage a local-like data center node remotely, blurring the line between cloud and on-prem. The narrative also highlights the acquisition-era momentum: marketing stunts and tangible, tangible assets (like custom GPU cards) helped cement a culture of attention to detail and hands-on craft. The upshot is a developer experience so streamlined that teams can focus on task-level goals—deploying a model, testing an idea, or spinning up a training job—without getting bogged down in provisioning minutiae. Crucially, Brev’s value proposition aligns with NVIDIA’s broader developer-experience strategy: reduce friction, accelerate iteration, and bring GPU power into more hands across a wider spectrum of users, from researchers to startups to enterprise teams.

OPEN CLAW AND AGENT INFRASTRUCTURE AT PLANETARY SCALE

The discussion pivots to agents that can access files, browse the web, and write code—three capabilities that, in combination, create a significant attack surface. A key takeaway is the insistence that agents should be limited to two of these capabilities at a time to minimize risk; if an agent can access the internet and files but not code, or vice versa, you reduce the chance of malware or data exfiltration. The conversation then centers on enforcement points: where and how to run agents matters as much as what they can do. OpenClaw is presented as a pivotal step toward robust tool usage, enabling agents to call external tools safely, while Brev provides a sandboxed environment that keeps sensitive corporate networks out of reach. This approach also underpins a broader strategy: empower developers with powerful capabilities while maintaining strong security rails, ensuring that agent-based workflows remain practical at scale without compromising enterprise safety. The speakers emphasize that this is a live, evolving design problem—one where UX, policy, and architecture must evolve in concert as agent capabilities expand.

DYNAMO: SCALING INFERENCE AT DATA CENTER SCALE

Dynamo is introduced as a data-center-scale inference engine designed to sit atop existing model frameworks (VLM, SANG, Tensor TLM) and accelerate inference by exploiting the economics of scale. A central theme is the inevitability of scale-out rather than mere scale-up: once you reach hardware bottlenecks or interconnect limits (like NVLink vs InfiniBand), simply adding more of the same GPU is not enough. Dynamo embraces disaggregation, isolating prefill (read/prepare) from decode (token generation) so scheduling becomes more flexible and efficient. The KV cache, a critical component of stateful generation, is leveraged to maximize cache hits and minimize recomputation. The dialogue about scale also touches on the hardware realities (such as bandwidths and interconnect speeds) and the trade-offs between latency, cost, and model quality. The overarching message is that usable, cost-effective, planet-scale inference depends on modular, tunable infrastructure that can adapt to evolving models and workloads.

INDUSTRY PERSPECTIVES: KYLE'S JOURNEY, CULTURE, AND THE ROAD AHEAD

Kyle's trajectory—from tabular data to graph neural networks and beyond—highlights NVIDIA's culture of exploratory, passion-driven work. He recounts early experiences with recommender systems (Rexus), a shift to graph-based representations, and the realization that GPUs can accelerate a broad class of analytics workloads. The conversation underscores how NVIDIA nurtures internal mobility and cross-pollination: engineers are encouraged to pursue ideas, pitch them to higher-ups, and ship quickly. This ethos dovetails with the 'zero-dollar market' mindset—Jensen’s idea of funding exploratory ventures that may take time to pay off but can redefine markets. Kyle emphasizes that the end goal is not just a single product but a capability stack that empowers a wider, more diverse set of developers to leverage GPU-accelerated AI. The segment also reinforces the importance of a well-tuned developer experience to sustain momentum across rapidly evolving AI workflows, from foundation models to real-time inference in production.

Mentioned in This Episode

●Products

●Software & Apps

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

An agent can access files, access the internet, and write/execute custom code. The panel suggests letting an agent do only two of these three to minimize risk, as internet access can introduce vulnerabilities if the full scope isn't known.

Topics

Brev OpenCloud Grove Reuben CPX DGX Spark Disaggregation KV Cache Inference At Scale Developer UX Zero-billion-dollar Markets Hardware-aware AI

Mentioned in this video

Software & Apps

brev

A developer tool that makes it easy to get a GPU; focuses on one-click SSH access and GPU provisioning across multiple sources.

OpenCloud

A platform discussed for running agent-enabled workloads with security considerations; enables safer deployment of agents.

NIMS

NVIDIA Inference Management System; enterprise support tooling around inference and deployments (including Dynamo).

Grove

Kubernetes-based component in Dynamo that enables scaling specialization between prefill and decode workers.

Reuben CPX

A prefill-specific accelerator announced for future hardware generations to speed up prefill workloads.

DGX Spark

A DGX-system-like GPU cluster designed for large-scale inference workloads; discussed in context of speed/cost for inference.

GPTOSS

A model/architecture discussed in the context of context and attention degeneracy; referenced as part of hardware/software co-design.

Neotron

A set of models/architectures co-designed with hardware to maximize performance on NVIDIA GPUs; includes Neotron family.

NVIDIA Sync

A utility to simplify SSH access to GPU systems, part of Brev's value proposition for easy remote usage.

NVIDIA OpenCloud

OpenCloud context within NVIDIA for agent-enabled cloud workloads; mention of security constraints.

Products

Dynamo

NVIDIA's data center scale inference engine that sits on top of VLM/Sang and Tensor HLM to accelerate large-scale inference with features like KV cache and disaggregation.

Books

Kimmy 2

A transformer model design cited in context of hardware-aware model architecture choices (attention heads, experts, etc.).

People

Grace Blackwell

Named NVIDIA figure referenced in context of DGX Spark discussion and developer experience.

Alec

NVIDIA engineer involved in rev CLI and agent UX improvements; referenced in the open code/tools discussion.

Nat

NVIDIA engineer involved with Brev and UX aspects; referenced in the discussion about developer tooling.

Carlos

Employee mentioned in a fun anecdote about codeex/cloud code access; created an Outlook CLA CLI.

Companies

OpenAI

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free

NVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"

Key Insights

BREV: ONE-CLICK GPU ACCESS AND DEV UX

OPEN CLAW AND AGENT INFRASTRUCTURE AT PLANETARY SCALE

DYNAMO: SCALING INFERENCE AT DATA CENTER SCALE

INDUSTRY PERSPECTIVES: KYLE'S JOURNEY, CULTURE, AND THE ROAD AHEAD

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Latent Space

🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found

⚡️ Competing with ChatGPT and Sierra, building a $10M ARR company — Yasser Elsaid, Founder, Chatbase

CI/CD Breaks at AI Speed: Tangle, Graphite Stacks, Pro-Model PR Review — Mikhail Parakhin, Shopify

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Ask anything from this episode.