⚡️Accelerators @ 3x NVIDIA H200 perf, Made in the USA - Thomas Sohmers + Mitesh Agrawal, Positron AI
Key Moments
Positron AI builds efficient AI inference accelerators focusing on memory bandwidth, outperforming NVIDIA on key metrics.
Key Insights
AI inference bottleneck is primarily memory bandwidth, not compute, particularly for transformer models.
Positron AI's architecture achieves significantly higher memory bandwidth utilization (93% vs. ~29% on NVIDIA H100).
The company's current FPGA-based accelerators offer 70% higher performance than NVIDIA's H100 at a lower power and price point.
Positron AI prioritizes seamless integration into existing NVIDIA CUDA ecosystems, accepting raw binary weights without recompilation.
Future ASIC generation promises even greater memory capacity and performance improvements over current FPGA solutions.
The company emphasizes capital efficiency, aiming to sell systems and demonstrate strong Return on Invested Capital (ROIC) rather than operate cloud services.
FOUNDING STORY AND EXPERTISE
Thomas Sohmers and Mitesh Agrawal, co-founders of Positron AI, bring extensive semiconductor and AI infrastructure experience. Sohmers, a hardware veteran from Rex Computing and Lambda Labs, identifies memory bandwidth as the critical bottleneck in AI inference. Agrawal, an early employee and former CEO of Lambda Labs, leveraged his operational and growth expertise to build Positron AI. Their shared vision stems from recognizing the limitations of existing hardware for the emerging demands of transformer models.
THE MEMORY BANDWIDTH BOTTLENECK
The core thesis driving Positron AI is that modern AI workloads, especially transformer inference, are overwhelmingly memory-bound rather than compute-bound. Unlike older CNN models that were compute-intensive, transformers performing matrix-vector multiplications require constant data movement. This contrasts with traditional hardware architectures optimized for FLOPS, leading to inefficiencies where memory bandwidth and capacity are the primary constraints hindering performance.
POSITRON AI'S ARCHITECTURAL ADVANTAGE
Positron AI's novel architecture is meticulously designed to maximize memory bandwidth utilization, achieving up to 93% of theoretical bandwidth compared to NVIDIA's H100 which utilizes only about 29%. This is enabled by specialized compute elements optimized for a 1:1 ratio of FLOPS to memory operations. By focusing on this critical aspect, their current FPGA-based accelerators deliver superior performance per watt and per dollar, even when compared against high-end NVIDIA GPUs.
SEAMLESS ECOSYSTEM INTEGRATION
A key differentiator for Positron AI is its commitment to simplifying adoption. Instead of requiring users to recompile models or change their existing workflows, their hardware directly ingests raw binary weights produced by NVIDIA's CUDA ecosystem. This 'zero-step' integration allows models trained on NVIDIA GPUs to run on Positron AI's hardware with minimal effort, enabling customers to easily benefit from performance and efficiency gains without disrupting their development pipelines.
ROADMAP TO DEDICATED SILICON
While currently shipping FPGA-based solutions for rapid market entry, Positron AI is developing next-generation custom silicon (ASIC). This dedicated hardware will further enhance performance and efficiency by addressing the inherent limitations of FPGAs, offering significantly more memory capacity and surpassing current industry benchmarks. The company aims for these ASICs to be tape-out ready by late 2026, promising a substantial leap in inference capabilities.
BUSINESS STRATEGY AND CAPITAL EFFICIENCY
Positron AI's business model is centered on selling hardware systems and demonstrating strong Return on Invested Capital (ROIC) for its customers, eschewing the cloud-service model adopted by some competitors. They emphasize capital efficiency in their funding, having raised $75 million across seed and Series A rounds to focus on product development and customer acquisition. This approach aims to deliver tangible value and economic viability, positioning Positron AI as a strong contender in the AI hardware market.
PERFORMANCE METRICS AND FUTURE APPLICATIONS
The company highlights its advantage in token generation (inference decode) over prefill. Their solutions are not only effective for text generation but also crucial for emerging modalities like video and complex reasoning, where generation stages often dominate. By enabling massive memory capacity and bandwidth, Positron AI aims to support advancements in multimodal AI and allow for more efficient, cost-effective inference at scale, ultimately expanding the possibilities for AI deployment.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
Positron AI is developing specialized hardware accelerators focused on improving performance per dollar and performance per watt for AI inference workloads, particularly transformers. Their mission is to address the memory bandwidth bottleneck that limits current AI systems.
Topics
Mentioned in this video
Application-Specific Integrated Circuits are the target for Positron's next generation of silicon, promising greater efficiency and capacity than FPGAs.
A 19-bit number format used by NVIDIA. Positron AI uses a similar format in their FPGA, offering better precision than BF16.
Mentioned as an example of an image generation system using a pure autoregressive transformer.
Field-Programmable Gate Arrays are discussed as a prototyping and simulation tool. Positron leverages them for their current product but plans to move to dedicated silicon for the next generation.
A 16-bit floating-point format mentioned in comparison to NVIDIA's TF32.
A process node for semiconductor manufacturing relevant to Positron AI's hiring needs for silicon engineers.
Specialized chips for cryptocurrency mining that Thomas Sohmers worked on before Lambda Labs.
A company focused on developing specialized hardware accelerators for AI inference, aiming for higher performance per dollar and watt.
Investor in Positron AI's Series A funding round.
Another AI chip startup that has faced public criticism from George Hot; Positron acknowledges the difficulty of new silicon but disagrees with Etched's hardening approach.
Another Google image generation model that has transitioned to an autoregressive transformer architecture.
A process node for semiconductor manufacturing relevant to Positron AI's hiring needs for silicon engineers.
Mentioned for his public criticism of Etched regarding their approach to AI chip design.
A customer utilizing Positron AI's hardware for raw performance in their inference-as-a-service offerings.
A venture capital firm that participated in Positron AI's Series A funding round.
A process node for semiconductor manufacturing relevant to Positron AI's hiring needs for silicon engineers.
Thomas Sohmers' first semiconductor startup focused on DSP for mobile base station workloads.
More from Latent Space
View all 63 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free