What are the main challenges in serving very large models like DeepSeek V3?

Serving models like DeepSeek V3 presents challenges due to their size, requiring powerful hardware like H200 GPUs to handle the 671 billion weights and KV cache. Implementing specific precision formats like FP8 and ensuring efficient loading times for debugging are also significant hurdles.

Why are users migrating from models like Claude to open-source alternatives hosted on Base10?

Users are migrating from models like Claude to open-source solutions due to issues such as rate limiting, high costs, and latency requirements. They also seek greater control over the models, avoiding potential changes to models served behind APIs.

What is Base10's approach to pricing for LLM inference?

Base10 uses a consumption-based pricing model, charging based on hardware and resources used. This applies whether the model runs on Base10's infrastructure across multiple clouds or on customer-provided cloud resources.

What are the key differences between inference frameworks like SGLang, vLLM, and Triton?

SGLang is highlighted for its balance of performance and usability, making it suitable for common use cases and optimized for models like DeepSeek V3. vLLM offers strong community support but may be harder to extend, while Triton is known for its performance with CUDA kernels.

What are the distinct technical advantages of SGLang?

SGLang offers unique features like Radix Attention for efficient KV cache utilization and Contiguous Decoding using finite state machines for faster, more structured output. It also supports specific optimizations for models like DeepSeek V3 and has integrated tools like XGrammar.

What are the three essential pillars for running mission-critical inference workloads?

The three pillars are: 1) performance at the model level (achieved through frameworks and techniques like speculative decoding), 2) horizontal scaling of infrastructure to meet demand across regions and clouds, and 3) enablement of complex, multi-step, low-latency workflows.

How is the trend towards Mixture-of-Experts (MoE) evolving in the LLM space?

MoE architectures, seen in models like DeepSeek V3, are becoming increasingly important for inference optimization. While some large labs have struggled with MoE, the trend is gaining traction, with models like Gemini 1.5 Pro also rumored to be MoE-based.

Key Moments

DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)

Latent Space Podcast

Science & Technology5 min read58 min video

Jan 19, 2025|4,558 views|108|8

Save to Pod

Key Moments

On this page

TL;DR

DeepSeek V3, SGLang, and open model inference are advancing with new tech like FP8 and MoEs.

Key Insights

DeepSeek V3 is a state-of-the-art open-source 671B MoE model, pushing the boundaries of AI.

Serving massive models like DeepSeek V3 requires advanced hardware (H200, multinodes) and custom kernel support (FP8).

Base10 focuses on dedicated inference for custom workflows, prioritizing latency, throughput, and control over shared endpoints.

FP8 training is a growing trend, offering performance benefits but requiring new inference kernels and tooling.

Mixture-of-Experts (MoE) architectures, like DeepSeek V3, are gaining traction, potentially outperforming dense models.

SGLang is an emerging inference engine offering high performance and improved usability, optimizing for large batches and long contexts.

Mission-critical inference requires a three-pillar approach: model-level performance, horizontal scaling, and robust workflow enablement.

While frameworks like SGLang and Triton are crucial, they are part of a larger infrastructure challenge for scalable and reliable inference.

THE RISE OF DEEPSEEK V3 AND OPEN SOURCE MODELS

The conversation kicks off with the highly anticipated DeepSeek V3, a massive 671 billion parameter Mixture-of-Experts (MoE) model. It's positioned as the leading open-source Large Language Model (LLM), evidenced by its strong performance on benchmarks and leaderboards. The model's impressive scale and capabilities, including FP8 mixed-precision training and advanced attention mechanisms, make it a game-changer. A notable trend highlighted is the increasing willingness of Chinese labs to release open-weight models, a practice beneficial for platforms like Baise10 and the broader AI community.

CHALLENGES AND SOLUTIONS FOR LARGE MODEL INFERENCE

Serving colossal models like DeepSeek V3 presents significant hardware and software challenges. The sheer memory requirement, even with FP8 quantization (671GB for weights plus KV cache), often exceeds H100 capabilities, necessitating H200 GPUs or multinode setups. Furthermore, supporting novel inference optimizations like FP8 requires custom kernel development, as standard frameworks may not yet offer native support. Debugging and performance benchmarking also become more complex due to the extended loading times associated with such large models.

BASE10'S STRATEGY: DEDICATED INFERENCE FOR CUSTOM WORKFLOWS

Base10 differentiates itself by not offering shared inference endpoints for popular open-source models. Instead, their focus is on providing dedicated, managed inference resources for companies with custom models or specific workflow requirements. This approach caters to clients needing strict latency, throughput, security, and compliance guarantees. Pricing is consumption-based, reflecting the hardware and resources utilized, whether on Base10's infrastructure or the customer's own cloud environment, which increasingly includes multi-cloud deployments.

FP8 TRAINING AND THE FUTURE OF MODEL QUANTIZATION

Native FP8 training, as utilized by DeepSeek V3, is emerging as a significant trend. While BF16 has been common, FP8 offers computational advantages. However, implementing FP8 inference necessitates specialized kernels, as standard libraries may not support it. The discussion touches upon research indicating benefits of quantization down to six bits or smaller, suggesting a potential shift in how models are trained and deployed for efficiency, though the trade-offs and specific requirements for maintaining output quality are critical considerations.

THE ASCENDANCE OF MIXTURE-OF-EXPERTS (MOE) ARCHITECTURES

Mixture-of-Experts (MoE) models, exemplified by DeepSeek V3, are becoming increasingly relevant. While some large labs have focused on dense models, the success of MoE architectures in open models suggests a promising direction. Internal use of MoE by companies like Baidu and ByteDance further validates this trend. The inherent complexities of MoE training and inference optimization are driving innovation in the field, with a growing emphasis on efficient MoE inference.

SGLANG: A NEW FRONTIER IN LLM INFERENCE ENGINES

SGLang is presented as a next-generation LLM inference engine designed for high performance and improved developer experience. It excels in scenarios involving large batch sizes and extended context lengths, offering significant throughput advantages over existing solutions like vLLM and TensorRT-LLM. SGLang's unique features include optimizations for specific models like DeepSeek V3, support for advanced techniques like blockwise FP8 kernels, and a sophisticated caching mechanism (Radix Cache) for enhanced KV cache utilization critical for large models.

KEY TECHNIQUES POWERING SGLANG'S PERFORMANCE

SGLang employs several innovative techniques. Radix Cache, a prefix caching method using a block size of one, significantly improves cache hit rates compared to traditional block sizes. Contraction Decoding and 'Jump Forward' allow for more efficient generation of structured outputs by pre-determining output paths and skipping unnecessary token decoding. While powerful, these advanced features can be complex to maintain and integrate with other optimizations, as seen with the default disabling of jump forward.

SPECULATIVE DECODING AND THE EVOLUTION OF STRUCTURED OUTPUTS

Speculative decoding, a technique that uses a smaller 'draft' model to predict tokens, is a key area of development for improving inference speed. Frameworks like SGLang support it, but its effectiveness relies heavily on the quality of the draft model training. Similarly, generating structured outputs, whether JSON or through defined grammars like X-grammar, is becoming more robust. X-grammar, in particular, is highlighted for its performance and integration into engines like TensorRT-LLM, offering accuracy and speed benefits.

ROLES OF FRAMEWORKS IN MISSION-CRITICAL INFERENCE

The discussion emphasizes that inference frameworks are only one component of mission-critical inference. Three essential pillars are identified: model-level performance (enhanced by frameworks like SGLang, Triton, and vLLM), horizontal scaling of models across regions and clouds to meet demand and avoid resource starvation, and enablement of complex, low-latency, multi-step workflows. Frameworks contribute primarily to the first pillar but must be integrated within a robust infrastructure for true production readiness.

TRAINING TRENDS: RL AND CUSTOMIZATION VS. GENERAL REASONING

The conversation touches upon the role of Reinforcement Learning (RL) and human-in-the-loop training for specialized tasks, particularly in domains like healthcare. While advanced reasoning capabilities in future models might reduce the need for extensive fine-tuning, traditional fine-tuning and RL methods remain crucial for many current use cases. The cost-effectiveness and long-term viability of purely reasoning-based models versus specialized, fine-tuned models remain open questions.

THE DEVELOPER EXPERIENCE AND CHOICE IN INFERENCE RUNTIMES

Base10 embraces a multi-framework approach, supporting TensorRT-LLM, vLLM, and SGLang based on customer requirements. While TensorRT-LLM is known for raw performance, SGLang offers a compelling combination of speed and usability, making it a recommended choice for models like DeepSeek V3. Developers increasingly value flexibility, allowing them to pick the best-suited framework for specific inference workloads rather than being locked into a single solution.

THE FUTURE OF INFERENCE: SCALABILITY AND WORKFLOW ENABLERS

Looking ahead, the focus remains on enabling scalable, reliable, and low-latency inference. This involves not only optimizing individual models but also building sophisticated infrastructure to handle massive horizontal scaling across multiple clouds and regions. Furthermore, tools that simplify the orchestration of complex, multi-step, multi-modal inference workflows are critical for enabling advanced AI applications, ensuring that performance gains in model architecture translate into real-world capabilities.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

DeepSeek V3 is notable for being a 671 billion parameter Mixture-of-Experts (MoE) model, currently considered a leading open-source LLM based on benchmarks. Its large size and FP8 precision make it a game-changer for the open-source AI community.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Performance Optimization Developer Experience Cloud Infrastructure LLM Inference Model Quantization Inference Frameworks

Mentioned in this video

People

Amir Haghtalab

Co-founder of Base10, involved in hosting and serving large language models.

Software & Apps

Claude

A commercial model from which users are migrating to open-source alternatives like DeepSeek V3 hosted on Base10, often due to rate limiting, high prices, latency requirements, or a desire for model control.

Truss

Base10's open-source library for model packaging and deployment, designed to make serving models easier, with deep integration for frameworks like Triton and SGLang.

Outline

A tool used for generating structured outputs from LLMs by converting schemas to finite state machines, though XGrammar is seen as a higher-performing alternative.

DeepSeek V3

A 671 billion parameter Mixture-of-Experts (MoE) model, considered a leading open-source LLM based on benchmark and chat arena results, notable for its large size and FP8 precision.

Truss Chains

A feature within Truss designed for building multi-step, multi-model inference workloads with low latency, enabling models to stream data to each other.

LLaMA 3 405B

A larger LLaMA model (405 billion parameters) that has seen limited adoption, with users finding the performance gains at inference not worth the cost compared to the 70B version.

vLLM

A popular LLM inference engine, often compared to SGLang and Triton, known for its performance but sometimes criticized for potential code messiness and difficulty in extending.

Hugging Face TRL

A library for training transformer-based language models, including Reinforcement Learning (RL) trainers, which Base10 supports.

Llama 70B

A common model size hosted on Base10, often considered a sweet spot for performance and cost, though larger models like DeepSeek V3 are also being supported.

Triton Inference Server

An inference serving software developed by NVIDIA, which Truss has had to extend and sometimes replace with custom versions for specific performance and reliability needs.

XGrammar

A tool from MLC that converts schemas (like JSON) into finite state machines for use in constrained decoding, preferred by the speaker over Outline for its performance.

Cursor

A code editor that the speaker uses daily, which integrated support for DeepSeek V3 after Base10 released their support.

GCP

Google Cloud Platform, another public cloud Base10 utilizes and where customers might have committed resources for BYOC deployments.

Gemini 1.5 Pro

A model mentioned by Jeff Dean which is believed to utilize a Mixture-of-Experts (MoE) architecture, previously not publicly known.

Open RHF

A framework for Reinforcement Learning from Human Feedback (RLHF), supported by Base10.

SGLang

An inference engine and framework for large language models, developed for high performance, better usability compared to other frameworks, and optimized for specific models and techniques like continguous decoding and radix attention.

AWS

Amazon Web Services, one of the public clouds Base10 runs on, and a cloud provider where customers may have committed resources for BYOC (Bring Your Own Cloud) deployments.

Concepts

FP8

A precision format (8-bit floating point) used for DeepSeek V3 weights, requiring specific kernel support for inference, and a trend in native quantization during training.

Radix Attention

A prefix caching technology supported by SGLang, which uses a block size of 1 for potentially higher cache hit rates compared to frameworks using larger block sizes like 32.

Contiguous Decoding

A technique supported by SGLang that uses finite state machines (derived from schemas like JSON via tools like XGrammar) to control output, enabling faster decoding by skipping unnecessary tokens.

BF16

A default precision format (Bfloat16) often used for training LLMs, contrasted with FP8 which requires specific kernel implementation for inference.

HIPAA

The Health Insurance Portability and Accountability Act, relevant for compliance in certain inference workloads discussed by Base10.

Mixture-of-Experts

An architectural trend where models consist of multiple 'expert' sub-networks, used in models like DeepSeek V3, offering potential benefits in efficiency and performance, with ongoing optimization efforts.

speculative decoding

A technique for speeding up inference by using a draft model to predict tokens, which are then verified by the larger target model. Support for variations exists in SGLang and other frameworks.

12-Factor App

A methodology for building software as a service applications, suggested as a model for a manifesto on mission-critical inference workloads.

Companies

Bland AI

A company whose use case of AI phone calls requires multi-step inference (transcription, LLM calls, text-to-speech), facilitated by Truss Chains for low latency.

XAI

An organization where the creators of SGLang are members of technical staff, indicating a connection and potential influence.

Base10

An LLM inference platform that hosts and serves large models like DeepSeek V3, offering dedicated infrastructure and a consumption-based pricing model, rather than shared endpoints.

Products

H200

A GPU recommended for serving large models like DeepSeek V3 due to its larger memory capacity.

H100

A GPU that is insufficient for serving the DeepSeek V3 model due to memory constraints (requiring 640 GB for weights alone, plus KV cache).

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free