Inferact: Building the Infrastructure That Runs Modern AI

a16za16z
Gaming3 min read45 min video
Jan 22, 2026|814 views|25|3
Save to Pod
TL;DR

Open-source inference engine (VLM) evolving into a universal AI runtime via Infraact.

Key Insights

1

VLM originated at UC Berkeley as an open-source inference runtime and is being advancing toward a universal inference layer via Infraact.

2

Inference engine design tackles dynamic prompts, non-deterministic token generation, and memory management with scheduling and KV-cache aware strategies.

3

Scale, diversity, and agents drive complexity: trillion-parameter models, heterogeneous hardware, and tool/agent workflows complicate optimization.

4

Open-source community is central: 50+ regular contributors, 2,000+ GitHub contributor events, global meetups, and broad model/hardware participation.

5

Infraact prioritizes open source as the core strategy, leveraging academic and industry collaboration to create an interoperable, scalable runtime for future AI workloads.

Origin And Vision Of Inferact And VLM

Inferact and VLM began as a Berkeley-origin prototype and grew into a widely adopted open-source inference runtime. The speakers emphasize open source as essential to AI infrastructure, with the goal of a universal inference layer that can power any model on any hardware for any application. The founders, Simon Mo and Wu Kuan, alongside their academic mentors, aim to sustain, steward, and advance the ecosystem so VLM becomes a standard that enables broad collaboration and broad deployment across multiple silicon and workloads.

How Inference Engines Work: Architecture And Data Path

An inference engine takes a trained model, runs it on accelerated hardware, and aims for maximal throughput and efficiency. The typical flow includes an API/server layer, a tokenizer to convert input into tokens, and an engine with a scheduler and memory manager to orchestrate work and KV caches. A worker then loads and executes the model, handling preprocessing and postprocessing. Inference for language models is dynamic: prompts vary in length, stopping is non-deterministic, and tokens stream back. This requires careful microbatching, per-token scheduling, and memory management to maintain performance.

Scale, Diversity, And Agents: New Hard Parts Of Inference

The field is getting harder due to scale, diversity, and agent-enabled workflows. Large models are growing toward multi-trillion-parameter scales, necessitating model and data parallelism across many GPUs and nodes. Hardware diversity across NVIDIA, AMD, Google, AWS, and others, plus model architecture variety (different attention mechanisms, tool usage patterns), complicates optimization. The rise of agents and tool use introduces long-running, asynchronous interactions, requiring persistent state management (such as KV caches) and adaptive cache strategies.

Community, Funding, And Open-Source Governance

VLM has cultivated a thriving ecosystem with 50+ regular full-time contributors and over 2,000 GitHub contributors. The community spans academia, industry, and hardware vendors, with global meetups helping align efforts across silicon and software. Open-source funding supports continuous integration, testing, and large-scale deployment (potentially millions of GPUs). The development approach combines clear roadmaps and milestone-driven work with open PR reviews and iterative refactoring. Yan Stoka’s advisory role guides research-to-practice adoption and talent recruitment.

Infract's Mission And The Open-Source Future

Infract positions itself to make VLM the world’s universal inference layer, prioritizing open source as the company’s core mission. The team argues that open source fosters diversity in models, hardware, and workloads, enabling faster, more adaptable innovation than proprietary stacks. Real-world deployments (e.g., large-scale inference used by major platforms) illustrate rapid adoption and the need for a scalable, interoperable runtime. Infract plans to sponsor, maintain, and evolve the ecosystem, tying academic insight to industry-scale deployment for broad AI progress.

More from a16z Deep Dives

View all 38 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free