Key Moments
The Engineering Unlocks Behind DeepSeek | YC Decoded
Key Moments
DeepSeek's open-source R1 model matches competitor performance at a fraction of the cost, achieved through aggressive GPU optimization and novel reasoning techniques, but recent rapid advancements mean it's already being surpassed.
Key Insights
DeepSeek V3, a base model released in December, uses an 8-bit floating-point format for training, resulting in massive memory savings without performance sacrifice.
DeepSeek V3 employs a mixture of experts (MoE) architecture with 671 billion parameters but only activates 37 billion per token, drastically saving computation compared to dense models like Llama 3.
DeepSeek's R1 model leverages Group Reinforcement Learning (GRLPO) to achieve reasoning capabilities purely through self-play, without external human or AI examples.
The reported $5.5 million training cost for DeepSeek V3 only covers the final training run and excludes R1's development, R&D, and hardware operating expenses.
A UC Berkeley lab replicated DeepSeek's key techniques on a smaller model for just $30, demonstrating reproducibility and high cost-efficiency.
Just two weeks after R1's release, OpenAI launched 03 Mei, which outperforms both R1 and OpenAI's own 01 on key benchmarks, highlighting the rapid pace of AI innovation.
DeepSeek distinguishes between its general-purpose and reasoning models
The recent attention on DeepSeek stems from its R1 model, an open-source reasoning model claimed to rival OpenAI's 01 in performance at a significantly lower cost. It's crucial to differentiate R1 from DeepSeek V3, a general-purpose base model released in December 2024 that shows comparable performance to models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5. R1, released in late January 2025, is built upon V3, with additional algorithmic enhancements specifically optimizing its reasoning abilities. These advancements, while leading to R1's impressive results on complex reasoning benchmarks, were often previewed in earlier DeepSeek publications, including the V3 paper from December and the V2 paper from May 2024, indicating a consistent research and development trajectory.
Training efficiency achieved through 8-bit floating point and error correction
A significant factor enabling DeepSeek's efficiency is its native training of V3 in an 8-bit floating-point (fp8) format, deviating from the more common 16-bit or 32-bit formats. While not a novel concept, this approach yields substantial memory savings without compromising performance. A critical enhancement supporting this is their fp8 accumulation fix, which periodically merges calculations back into a higher precision fp32 accumulator. This prevents the compounding of small numerical errors, ensuring greater accuracy and efficiency across thousands of GPUs, thereby cutting training costs while maintaining model quality. This optimization is particularly vital given hardware constraints and US export controls limiting GPU access in China, forcing DeepSeek to maximize compute and bandwidth from their existing hardware.
Maximizing GPU utilization and compute power
DeepSeek's innovations are geared towards overcoming GPU idleness, a common bottleneck where GPUs spend significant time waiting for data. At fp8, model flops utilization (MFU) often hovers around 35%, meaning GPUs are underutilized. NVIDIA's strength lies in its integrated ecosystem of networking, software (CUDA), and developer tools, essentially creating a 'giant GPU' experience. DeepSeek counters this by optimizing for efficient distributed systems. Their strategy involves enhancing bandwidth and compute operations to keep GPUs busy, a key challenge for AI labs operating with limited hardware resources. This focus on efficient hardware utilization is central to their ability to train large, performant models cost-effectively.
Mixture of Experts architecture and multi-head attention for scalability
DeepSeek V3 employs a Mixture of Experts (MoE) architecture, a design choice that significantly enhances computational efficiency. While V3 boasts 671 billion modern parameters, only 37 billion are activated for any given token prediction. This contrasts sharply with dense models like Llama 3, which activate all 405 billion parameters for each token. This means V3 activates 11 times fewer parameters per forward pass, resulting in substantial savings in computation. Although MoE is not new, efficiently training such models has been challenging. DeepSeek has implemented novel techniques to stabilize performance and boost GPU utilization. Furthermore, to address KV cache storage limitations, a major bottleneck in large models, V3 utilizes Multi-Head Latent Attention (MLA). MLA compresses key and value matrices into a latent representation, reconstructing them only when necessary. This reduced the KV cache size by 93.3% in the V2 model and boosted generation throughput by 5.76 times.
Multi-token prediction enhances training and inference
Another key innovation in DeepSeek V3 is its use of Multi-Token Prediction (MTP). Unlike traditional models that predict only the next token, MTP allows V3 to anticipate multiple future tokens at each step. This densifies training signals, providing more feedback per step for improved data efficiency and faster learning. It also enhances representation planning, enabling the model to pre-plan sequences for smoother, more coherent outputs during inference. MTP modules can be repurposed for speculative decoding, significantly speeding up generation by reducing sequential processing steps. These combined optimizations make V3 a highly capable base model.
Reinforcement Learning for reasoning capabilities in R1
The release of DeepSeek R1, a reasoning model, generated significant excitement. While most Large Language Models (LLMs) benefit from prompting techniques like 'Chain of Thought,' reasoning models are specifically trained to deconstruct complex problems into sequential steps. OpenAI showcased this with their 01 model, and DeepSeek followed suit with R1. Both achieve impressive results using reinforcement learning (RL). Reasoning models specifically apply RL to enhance step-by-step thinking. DeepSeek implemented this by assembling problems with verifiable outputs, particularly in math and coding, and designing a pipeline for the model to work through them. Crucially, they did not provide external examples of how to think. Instead, they used simple rules to evaluate the accuracy and formatting of the final output. Through a novel technique called Group RL Policy Optimization (GRLPO), published in February 2024, DeepSeek observed reasoning emerge over thousands of RL steps, including skills like extended Chain of Thought and self-correction.
R1's early limitations and cost realities
While R1 demonstrated remarkable reasoning abilities through pure RL, its initial outputs suffered from poor readability, with random switching between English and Chinese. To address this, DeepSeek introduced a cold-start phase, fine-tuning on structured reasoning examples before RL. This eliminated language mixing and improved comprehensibility. Although R1 achieved performance comparable to OpenAI's 01 on certain benchmarks, the pace of innovation is extreme. Just two weeks later, OpenAI released 03 Mei, which surpassed both R1 and 01. The excitement around R1 also involved misconceptions about its cost. The reported $5.5 million training cost for V3 was specifically for the final run and did not include R1's development, related R&D, or ongoing operational expenses, which would run into the hundreds of millions. However, the efficiency optimizations make this final training cost plausible and demonstrate the possibility of reproducible, cost-effective AI development, as evidenced by a $30 replication at UC Berkeley.
Accessibility and the future of AI development
The hype cycle surrounding DeepSeek's R1 can be attributed to several factors: its accessibility, being freely available, downloadable, and customizable; its near state-of-the-art performance at a fraction of the cost of proprietary models; and the rapid pace of AI innovation, where even cutting-edge models are quickly surpassed. DeepSeek has proven that significant room exists for new players in the AI frontier, particularly in optimizing GPU workloads, improving inference software tooling, and developing AI-generated kernels. This downward pressure on the cost of intelligence is highly beneficial for both consumer and B2B AI applications, making it an opportune time for AI startups.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
Common Questions
DeepSeek R1 is an open-source reasoning AI model from DeepSeek that claims comparable performance to OpenAI's GPT-4o on certain benchmarks, but at a significantly lower cost. It's built upon DeepSeek V3 and uses advanced techniques to optimize reasoning capabilities.
Topics
Mentioned in this video
A university lab that successfully applied DeepSeek's key techniques to produce complex reasoning in a smaller model at a significantly reduced cost.
An AI research organization whose closed-weight models are contrasted with DeepSeek's open approach. AlphaGo is mentioned as an example of DeepMind's past success with reinforcement learning.
A leading AI research company. Its models GPT-4o and GPT-4 are mentioned as benchmarks for DeepSeek's performance. Their approach to model weights and technical reports is contrasted with DeepSeek's.
The company behind the Llama model, which is cited as a precedent for DeepSeek's strategy of publishing research and releasing model weights.
An AI company whose models are mentioned as benchmarks. Claude 3.5 Sonnet is cited to be comparable in performance to DeepSeek V3.
A company whose market capitalization dropped significantly following the announcement of DeepSeek R1. NVIDIA's integrated hardware and software solutions for AI training are discussed as a key advantage.
An earlier model from DeepSeek, published in May 2024, which introduced innovations like Multi-Head Latent Attention (MLA) and efficient training techniques.
DeepSeek's general-purpose base model released in December, comparable to models like GPT-4o and Gemini 1.5. It incorporates various efficiency optimizations and innovations.
NVIDIA's parallel computing platform and programming model, mentioned as part of NVIDIA's integrated hardware and software solution for AI training.
An open-source reasoning AI model developed by Chinese AI company DeepSeek, claiming comparable performance to OpenAI's GPT-4o at a lower cost. It is built upon DeepSeek V3 with algorithmic improvements for reasoning abilities.
An AI model from OpenAI, achieving state-of-the-art results in math, coding, and science benchmarks. DeepSeek R1 is compared to its performance.
A model from Meta, compared to DeepSeek V3 for its parameter activation strategy. Llama 3 activates its full 405 billion parameters, while V3 uses a Mixture of Experts architecture.
An AI model from OpenAI, mentioned as a benchmark that DeepSeek V3 achieves comparable performance to.
An AI model from OpenAI mentioned as a performance benchmark for DeepSeek's V3 model and R1 reasoning model.
A Google AI model mentioned as a performance benchmark comparable to DeepSeek V3.
A Google AI model mentioned as a performance benchmark for DeepSeek R1 on complex reasoning tasks.
A model family from Meta, used as a comparison for DeepSeek's approach to open-sourcing research and model weights.
A subsequent AI model released by OpenAI shortly after DeepSeek R1, which outperforms both R1 and GPT-4o on key benchmarks.
An AI model from Anthropic, mentioned as a performance benchmark comparable to DeepSeek V3.
A specific model or paper from DeepSeek focused on mathematical capabilities, released in February 2024, contributing to the innovations in V3.
A Go-playing AI developed by DeepMind, used as an example of Western research labs successfully employing reinforcement learning for complex tasks.
More from Y Combinator
View all 562 summaries
14 minInside The Startup Reinventing The $6 Trillion Chemical Manufacturing Industry
1 minThis Is The Holy Grail Of AI
40 minIndia’s Fastest Growing AI Startup
1 minStartup School is coming to India! 🇮🇳
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free