Key Moments
DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)
Key Moments
DeepSeek V3, SGLang, and open model inference are advancing with new tech like FP8 and MoEs.
Key Insights
DeepSeek V3 is a state-of-the-art open-source 671B MoE model, pushing the boundaries of AI.
Serving massive models like DeepSeek V3 requires advanced hardware (H200, multinodes) and custom kernel support (FP8).
Base10 focuses on dedicated inference for custom workflows, prioritizing latency, throughput, and control over shared endpoints.
FP8 training is a growing trend, offering performance benefits but requiring new inference kernels and tooling.
Mixture-of-Experts (MoE) architectures, like DeepSeek V3, are gaining traction, potentially outperforming dense models.
SGLang is an emerging inference engine offering high performance and improved usability, optimizing for large batches and long contexts.
Mission-critical inference requires a three-pillar approach: model-level performance, horizontal scaling, and robust workflow enablement.
While frameworks like SGLang and Triton are crucial, they are part of a larger infrastructure challenge for scalable and reliable inference.
THE RISE OF DEEPSEEK V3 AND OPEN SOURCE MODELS
The conversation kicks off with the highly anticipated DeepSeek V3, a massive 671 billion parameter Mixture-of-Experts (MoE) model. It's positioned as the leading open-source Large Language Model (LLM), evidenced by its strong performance on benchmarks and leaderboards. The model's impressive scale and capabilities, including FP8 mixed-precision training and advanced attention mechanisms, make it a game-changer. A notable trend highlighted is the increasing willingness of Chinese labs to release open-weight models, a practice beneficial for platforms like Baise10 and the broader AI community.
CHALLENGES AND SOLUTIONS FOR LARGE MODEL INFERENCE
Serving colossal models like DeepSeek V3 presents significant hardware and software challenges. The sheer memory requirement, even with FP8 quantization (671GB for weights plus KV cache), often exceeds H100 capabilities, necessitating H200 GPUs or multinode setups. Furthermore, supporting novel inference optimizations like FP8 requires custom kernel development, as standard frameworks may not yet offer native support. Debugging and performance benchmarking also become more complex due to the extended loading times associated with such large models.
BASE10'S STRATEGY: DEDICATED INFERENCE FOR CUSTOM WORKFLOWS
Base10 differentiates itself by not offering shared inference endpoints for popular open-source models. Instead, their focus is on providing dedicated, managed inference resources for companies with custom models or specific workflow requirements. This approach caters to clients needing strict latency, throughput, security, and compliance guarantees. Pricing is consumption-based, reflecting the hardware and resources utilized, whether on Base10's infrastructure or the customer's own cloud environment, which increasingly includes multi-cloud deployments.
FP8 TRAINING AND THE FUTURE OF MODEL QUANTIZATION
Native FP8 training, as utilized by DeepSeek V3, is emerging as a significant trend. While BF16 has been common, FP8 offers computational advantages. However, implementing FP8 inference necessitates specialized kernels, as standard libraries may not support it. The discussion touches upon research indicating benefits of quantization down to six bits or smaller, suggesting a potential shift in how models are trained and deployed for efficiency, though the trade-offs and specific requirements for maintaining output quality are critical considerations.
THE ASCENDANCE OF MIXTURE-OF-EXPERTS (MOE) ARCHITECTURES
Mixture-of-Experts (MoE) models, exemplified by DeepSeek V3, are becoming increasingly relevant. While some large labs have focused on dense models, the success of MoE architectures in open models suggests a promising direction. Internal use of MoE by companies like Baidu and ByteDance further validates this trend. The inherent complexities of MoE training and inference optimization are driving innovation in the field, with a growing emphasis on efficient MoE inference.
SGLANG: A NEW FRONTIER IN LLM INFERENCE ENGINES
SGLang is presented as a next-generation LLM inference engine designed for high performance and improved developer experience. It excels in scenarios involving large batch sizes and extended context lengths, offering significant throughput advantages over existing solutions like vLLM and TensorRT-LLM. SGLang's unique features include optimizations for specific models like DeepSeek V3, support for advanced techniques like blockwise FP8 kernels, and a sophisticated caching mechanism (Radix Cache) for enhanced KV cache utilization critical for large models.
KEY TECHNIQUES POWERING SGLANG'S PERFORMANCE
SGLang employs several innovative techniques. Radix Cache, a prefix caching method using a block size of one, significantly improves cache hit rates compared to traditional block sizes. Contraction Decoding and 'Jump Forward' allow for more efficient generation of structured outputs by pre-determining output paths and skipping unnecessary token decoding. While powerful, these advanced features can be complex to maintain and integrate with other optimizations, as seen with the default disabling of jump forward.
SPECULATIVE DECODING AND THE EVOLUTION OF STRUCTURED OUTPUTS
Speculative decoding, a technique that uses a smaller 'draft' model to predict tokens, is a key area of development for improving inference speed. Frameworks like SGLang support it, but its effectiveness relies heavily on the quality of the draft model training. Similarly, generating structured outputs, whether JSON or through defined grammars like X-grammar, is becoming more robust. X-grammar, in particular, is highlighted for its performance and integration into engines like TensorRT-LLM, offering accuracy and speed benefits.
ROLES OF FRAMEWORKS IN MISSION-CRITICAL INFERENCE
The discussion emphasizes that inference frameworks are only one component of mission-critical inference. Three essential pillars are identified: model-level performance (enhanced by frameworks like SGLang, Triton, and vLLM), horizontal scaling of models across regions and clouds to meet demand and avoid resource starvation, and enablement of complex, low-latency, multi-step workflows. Frameworks contribute primarily to the first pillar but must be integrated within a robust infrastructure for true production readiness.
TRAINING TRENDS: RL AND CUSTOMIZATION VS. GENERAL REASONING
The conversation touches upon the role of Reinforcement Learning (RL) and human-in-the-loop training for specialized tasks, particularly in domains like healthcare. While advanced reasoning capabilities in future models might reduce the need for extensive fine-tuning, traditional fine-tuning and RL methods remain crucial for many current use cases. The cost-effectiveness and long-term viability of purely reasoning-based models versus specialized, fine-tuned models remain open questions.
THE DEVELOPER EXPERIENCE AND CHOICE IN INFERENCE RUNTIMES
Base10 embraces a multi-framework approach, supporting TensorRT-LLM, vLLM, and SGLang based on customer requirements. While TensorRT-LLM is known for raw performance, SGLang offers a compelling combination of speed and usability, making it a recommended choice for models like DeepSeek V3. Developers increasingly value flexibility, allowing them to pick the best-suited framework for specific inference workloads rather than being locked into a single solution.
THE FUTURE OF INFERENCE: SCALABILITY AND WORKFLOW ENABLERS
Looking ahead, the focus remains on enabling scalable, reliable, and low-latency inference. This involves not only optimizing individual models but also building sophisticated infrastructure to handle massive horizontal scaling across multiple clouds and regions. Furthermore, tools that simplify the orchestration of complex, multi-step, multi-modal inference workflows are critical for enabling advanced AI applications, ensuring that performance gains in model architecture translate into real-world capabilities.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
DeepSeek V3 is notable for being a 671 billion parameter Mixture-of-Experts (MoE) model, currently considered a leading open-source LLM based on benchmarks. Its large size and FP8 precision make it a game-changer for the open-source AI community.
Topics
Mentioned in this video
A commercial model from which users are migrating to open-source alternatives like DeepSeek V3 hosted on Base10, often due to rate limiting, high prices, latency requirements, or a desire for model control.
Base10's open-source library for model packaging and deployment, designed to make serving models easier, with deep integration for frameworks like Triton and SGLang.
A tool used for generating structured outputs from LLMs by converting schemas to finite state machines, though XGrammar is seen as a higher-performing alternative.
A 671 billion parameter Mixture-of-Experts (MoE) model, considered a leading open-source LLM based on benchmark and chat arena results, notable for its large size and FP8 precision.
A feature within Truss designed for building multi-step, multi-model inference workloads with low latency, enabling models to stream data to each other.
A larger LLaMA model (405 billion parameters) that has seen limited adoption, with users finding the performance gains at inference not worth the cost compared to the 70B version.
A popular LLM inference engine, often compared to SGLang and Triton, known for its performance but sometimes criticized for potential code messiness and difficulty in extending.
A library for training transformer-based language models, including Reinforcement Learning (RL) trainers, which Base10 supports.
A common model size hosted on Base10, often considered a sweet spot for performance and cost, though larger models like DeepSeek V3 are also being supported.
An inference serving software developed by NVIDIA, which Truss has had to extend and sometimes replace with custom versions for specific performance and reliability needs.
A tool from MLC that converts schemas (like JSON) into finite state machines for use in constrained decoding, preferred by the speaker over Outline for its performance.
A code editor that the speaker uses daily, which integrated support for DeepSeek V3 after Base10 released their support.
Google Cloud Platform, another public cloud Base10 utilizes and where customers might have committed resources for BYOC deployments.
A model mentioned by Jeff Dean which is believed to utilize a Mixture-of-Experts (MoE) architecture, previously not publicly known.
A framework for Reinforcement Learning from Human Feedback (RLHF), supported by Base10.
An inference engine and framework for large language models, developed for high performance, better usability compared to other frameworks, and optimized for specific models and techniques like continguous decoding and radix attention.
Amazon Web Services, one of the public clouds Base10 runs on, and a cloud provider where customers may have committed resources for BYOC (Bring Your Own Cloud) deployments.
A precision format (8-bit floating point) used for DeepSeek V3 weights, requiring specific kernel support for inference, and a trend in native quantization during training.
A prefix caching technology supported by SGLang, which uses a block size of 1 for potentially higher cache hit rates compared to frameworks using larger block sizes like 32.
A technique supported by SGLang that uses finite state machines (derived from schemas like JSON via tools like XGrammar) to control output, enabling faster decoding by skipping unnecessary tokens.
A default precision format (Bfloat16) often used for training LLMs, contrasted with FP8 which requires specific kernel implementation for inference.
The Health Insurance Portability and Accountability Act, relevant for compliance in certain inference workloads discussed by Base10.
An architectural trend where models consist of multiple 'expert' sub-networks, used in models like DeepSeek V3, offering potential benefits in efficiency and performance, with ongoing optimization efforts.
A technique for speeding up inference by using a draft model to predict tokens, which are then verified by the larger target model. Support for variations exists in SGLang and other frameworks.
A methodology for building software as a service applications, suggested as a model for a manifesto on mission-critical inference workloads.
A company whose use case of AI phone calls requires multi-step inference (transcription, LLM calls, text-to-speech), facilitated by Truss Chains for low latency.
An organization where the creators of SGLang are members of technical staff, indicating a connection and potential influence.
An LLM inference platform that hosts and serves large models like DeepSeek V3, offering dedicated infrastructure and a consumption-based pricing model, rather than shared endpoints.
More from Latent Space
View all 168 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free