NVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read86 min video
Mar 8, 2026|1,529 views|42|4
Save to Pod

Key Moments

TL;DR

NVIDIA scales AI agents with Brev and Dynamo, balancing UX, security, and inference.

Key Insights

1

Limit agent capabilities to reduce risk: agents should not have all three capabilities (files, internet, code) at once; two is safer and more controllable.

2

Brev redefines GPU access: one-click, SSH-friendly access to GPUs (including open home setups like Spark), fundamentally simplifying developer UX and speeding provisioning.

3

OpenClaw and agent tooling drive automation but require robust safeguards: sandboxed tool use in the cloud is favored over corporate networks to limit security exposure.

4

Dynamo enables data-center-scale inference: scale-out, disaggregation of prefill and decode, and intelligent KV caching unlock faster, cheaper, and more scalable inference across multiple models.

5

Developer experience is central at NVIDIA: a culture of hands-on engineering, rapid prototyping, and a willingness to explore ‘zero-dollar’ markets to push new AI capabilities.

6

Real-world deployment requires balancing quality, cost, and latency: models, workflows, and SLAs must be tuned to hit practical targets for production use.

BREV: ONE-CLICK GPU ACCESS AND DEV UX

Brev emerges as a developer-first gateway to GPUs, dramatically lowering the barrier to entry for working with high-end hardware. The vision is simple but profound: give developers a near-instant SSH pathway into a GPU, and frame the GPU in terms users actually care about—prominent, intuitive prompts that describe what they want rather than wrestling with opaque provisioning pages. The team recounts crafting GPUs as prominent UI elements, with large chips and vivid animations, to make the hardware feel approachable and immediately usable. This UX philosophy extends to the Spark home scenario, where Brev can manage a local-like data center node remotely, blurring the line between cloud and on-prem. The narrative also highlights the acquisition-era momentum: marketing stunts and tangible, tangible assets (like custom GPU cards) helped cement a culture of attention to detail and hands-on craft. The upshot is a developer experience so streamlined that teams can focus on task-level goals—deploying a model, testing an idea, or spinning up a training job—without getting bogged down in provisioning minutiae. Crucially, Brev’s value proposition aligns with NVIDIA’s broader developer-experience strategy: reduce friction, accelerate iteration, and bring GPU power into more hands across a wider spectrum of users, from researchers to startups to enterprise teams.

OPEN CLAW AND AGENT INFRASTRUCTURE AT PLANETARY SCALE

The discussion pivots to agents that can access files, browse the web, and write code—three capabilities that, in combination, create a significant attack surface. A key takeaway is the insistence that agents should be limited to two of these capabilities at a time to minimize risk; if an agent can access the internet and files but not code, or vice versa, you reduce the chance of malware or data exfiltration. The conversation then centers on enforcement points: where and how to run agents matters as much as what they can do. OpenClaw is presented as a pivotal step toward robust tool usage, enabling agents to call external tools safely, while Brev provides a sandboxed environment that keeps sensitive corporate networks out of reach. This approach also underpins a broader strategy: empower developers with powerful capabilities while maintaining strong security rails, ensuring that agent-based workflows remain practical at scale without compromising enterprise safety. The speakers emphasize that this is a live, evolving design problem—one where UX, policy, and architecture must evolve in concert as agent capabilities expand.

DYNAMO: SCALING INFERENCE AT DATA CENTER SCALE

Dynamo is introduced as a data-center-scale inference engine designed to sit atop existing model frameworks (VLM, SANG, Tensor TLM) and accelerate inference by exploiting the economics of scale. A central theme is the inevitability of scale-out rather than mere scale-up: once you reach hardware bottlenecks or interconnect limits (like NVLink vs InfiniBand), simply adding more of the same GPU is not enough. Dynamo embraces disaggregation, isolating prefill (read/prepare) from decode (token generation) so scheduling becomes more flexible and efficient. The KV cache, a critical component of stateful generation, is leveraged to maximize cache hits and minimize recomputation. The dialogue about scale also touches on the hardware realities (such as bandwidths and interconnect speeds) and the trade-offs between latency, cost, and model quality. The overarching message is that usable, cost-effective, planet-scale inference depends on modular, tunable infrastructure that can adapt to evolving models and workloads.

INDUSTRY PERSPECTIVES: KYLE'S JOURNEY, CULTURE, AND THE ROAD AHEAD

Kyle's trajectory—from tabular data to graph neural networks and beyond—highlights NVIDIA's culture of exploratory, passion-driven work. He recounts early experiences with recommender systems (Rexus), a shift to graph-based representations, and the realization that GPUs can accelerate a broad class of analytics workloads. The conversation underscores how NVIDIA nurtures internal mobility and cross-pollination: engineers are encouraged to pursue ideas, pitch them to higher-ups, and ship quickly. This ethos dovetails with the 'zero-dollar market' mindset—Jensen’s idea of funding exploratory ventures that may take time to pay off but can redefine markets. Kyle emphasizes that the end goal is not just a single product but a capability stack that empowers a wider, more diverse set of developers to leverage GPU-accelerated AI. The segment also reinforces the importance of a well-tuned developer experience to sustain momentum across rapidly evolving AI workflows, from foundation models to real-time inference in production.

Common Questions

An agent can access files, access the internet, and write/execute custom code. The panel suggests letting an agent do only two of these three to minimize risk, as internet access can introduce vulnerabilities if the full scope isn't known.

Topics

Mentioned in this video

toolbrev

A developer tool that makes it easy to get a GPU; focuses on one-click SSH access and GPU provisioning across multiple sources.

toolDynamo

NVIDIA's data center scale inference engine that sits on top of VLM/Sang and Tensor HLM to accelerate large-scale inference with features like KV cache and disaggregation.

toolOpenCloud

A platform discussed for running agent-enabled workloads with security considerations; enables safer deployment of agents.

toolNIMS

NVIDIA Inference Management System; enterprise support tooling around inference and deployments (including Dynamo).

toolGrove

Kubernetes-based component in Dynamo that enables scaling specialization between prefill and decode workers.

toolReuben CPX

A prefill-specific accelerator announced for future hardware generations to speed up prefill workloads.

toolDGX Spark

A DGX-system-like GPU cluster designed for large-scale inference workloads; discussed in context of speed/cost for inference.

bookKimmy 2

A transformer model design cited in context of hardware-aware model architecture choices (attention heads, experts, etc.).

toolGPTOSS

A model/architecture discussed in the context of context and attention degeneracy; referenced as part of hardware/software co-design.

toolNeotron

A set of models/architectures co-designed with hardware to maximize performance on NVIDIA GPUs; includes Neotron family.

toolNVIDIA Sync

A utility to simplify SSH access to GPU systems, part of Brev's value proposition for easy remote usage.

personOpenAI

Mentioned as a model provider in corporate governance context; not endorsed for affiliate links.

personGrace Blackwell

Named NVIDIA figure referenced in context of DGX Spark discussion and developer experience.

personAlec

NVIDIA engineer involved in rev CLI and agent UX improvements; referenced in the open code/tools discussion.

toolNVIDIA OpenCloud

OpenCloud context within NVIDIA for agent-enabled cloud workloads; mention of security constraints.

personNat

NVIDIA engineer involved with Brev and UX aspects; referenced in the discussion about developer tooling.

personCarlos

Employee mentioned in a fun anecdote about codeex/cloud code access; created an Outlook CLA CLI.

More from Latent Space

View all 13 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free