Key Moments

OpenAI vs. Deepseek vs. Qwen: Comparing Open Source LLM Architectures

Y CombinatorY Combinator
Science & Technology7 min read13 min video
Aug 29, 2025|31,117 views|786|16
Save to Pod
TL;DR

Open-source LLMs like GPT-OSS, Qwen 3, and DeepSeek V3.1 achieve similar top-line benchmark performance despite using vastly different architectural and training techniques, highlighting the opaque nature of data engineering as a key differentiator.

Key Insights

1

GPT-OSS utilizes a Mixture of Experts (MoE) architecture with 120B/20B parameters, activating the top four experts per token for efficient inference.

2

Qwen 3 employs a multi-stage pre-training approach on 36 trillion tokens, including a specific long-context stage and a four-step post-training pipeline.

3

DeepSeek V3.1, an evolution of V3, uses MLA (Multi-head Latent Attention) to compress keys and values, claiming greater memory savings and performance than Grouped Query Attention (GQA) in long context models.

4

The context window extension strategies differ significantly: GPT-OSS integrates Yarn scaling during pre-training for a native 131,000 token window, DeepSeek trains in phases to 128,000 tokens, and Qwen uses inference-time scaling for its 128,000 token capability.

5

Reinforcement learning (RL) plays a crucial role in post-training for all major models, with surprisingly small data requirements, such as Qwen's use of only 4,000 query-verifier pairs.

6

Despite similar benchmark results and common use of core LLM components, the papers offer few first-principles justifications for why specific techniques are superior, indicating data set engineering is a significant, opaque competitive advantage.

OpenAI's GPT-OSS introduces a Mixture of Experts architecture with long context capabilities

OpenAI's recent release of GPT-OSS marks its first open-weights model since GPT-2. Architecturally, it's a Mixture of Experts (MoE) model available in 120 billion and 20 billion parameter sizes. In MoE models, only a subset of parameters is activated for processing each token, enabling efficient inference. GPT-OSS activates the top four experts per token. It is a decoder-only transformer incorporating modern LLM features such as grouped query attention (GQA) for memory efficiency, Swiggloo activations for nuanced transformations, rotary positional embeddings (RoPE) for encoding positional information, and RMS norm with pre-normalization for stable training. A standout feature is its extensive 131,000 token context window, achieved by applying Yarn scaling directly during pre-training, allowing the model to natively handle very long sequences. The training data is a text-only corpus in the trillions of tokens, with a focus on STEM, coding, and general knowledge, though specific details remain undisclosed. By default, GPT-OSS is released in a quantized format, making it deployable on consumer hardware, but no unquantized version is provided. It has also undergone significant post-training for safety and alignment, with some in the community experimenting with removing these layers to explore its raw capabilities. This release positions GPT-OSS as a fully equipped, long-context model ready for immediate use in the open-source AI landscape.

Qwen 3 family offers diverse sizes and a sophisticated multi-stage training pipeline

Alibaba Cloud's Qwen 3 family, released in April, presents a range of models including both dense and MoE architectures. The dense models vary in size from 6 billion to 32 billion parameters, while the MoE models come in two sizes, featuring 128 experts with eight activated per token. Similar to GPT-OSS, Qwen 3 dense models incorporate GQA, Swiggloo, RoPE, and RMS norm. A key innovation in Qwen 3 is the use of QK norm, which dynamically rescales query and key vectors to maintain stable attention scores at scale, replacing the static QKV bias used in previous versions. The models were trained on 36 trillion pre-training tokens. Qwen 3's training occurred in three stages: a general stage with over 30 trillion tokens across 119 languages at 4096 sequence length; a reasoning stage with 5 trillion higher-quality tokens focused on STEM, reasoning, and coding; and a long context stage where context length was extended to over 32,000 tokens using algorithmic optimizations like ABF (Adaptive RoPE Frequency) and Yarn scaling. Post-training involves a four-step pipeline: a cold start stage for challenging reasoning problems, a reasoning RL stage using GRPO on ~4,000 query-verifier pairs, a thinking mode fusion technique to integrate reasoning and non-reasoning capabilities, and a general RL stage for broad instruction following and tool use. The models also leverage strong-to-weak distillation for creating smaller, capable versions. This complex pipeline results in impressive performance, especially given the models' sizes.

DeepSeek V3.1 builds upon V3 with enhanced long-context and hybrid inference capabilities

Released in December of the previous year, DeepSeek's V3 model was a significant open-source LLM. It is a massive 671 billion parameter MoE model, designed for both efficiency and capability, paving the way for its reasoning-focused R1 model. Key architectural and training advantages of V3 include its native 8-bit training, which drastically cuts costs. The updated V3.1 version, released recently, extends the original V3 checkpoint by incorporating a two-phase long context training approach and a hybrid thinking mode that allows for switching between reasoning-heavy and lightweight inference. It also boasts improved tool use and stronger reasoning through advanced post-training. A notable technical difference in V3 compared to GPT-OSS and Qwen 3 is its use of MLA (Multi-head Latent Attention). MLA compresses keys and values into a smaller latent space before caching, then decompresses them during inference. The DeepSeek V2 paper indicated that MLA offers greater memory savings and better modeling performance than GQA, particularly for very long contexts. This focus on optimizing the KV cache and attention mechanism is a key differentiator for DeepSeek.

Divergent approaches to extending context length

The strategies for extending context length vary significantly among these models. GPT-OSS is designed with long context natively, using Yarn scaling from pre-training to achieve its 131,000 token window. DeepSeek adopts a staged approach, fine-tuning its models to first reach 32,000 tokens and then further training to extend to 128,000 tokens. Qwen, on the other hand, fine-tunes to 32,000 tokens but then uses Yarn scaling at inference time to push the context to 128,000 tokens without additional retraining. This means GPT-OSS is built for long context from the start, DeepSeek is trained into it incrementally, and Qwen leverages inference-time optimizations on a base model. These different methods have implications for how the models handle and reason over extended sequences.

The empirical nature of LLM development and the role of reinforcement learning

A striking observation across these advanced LLM papers is their empirical nature. Labs describe combinations of tools and techniques that work well for them, but often lack first-principles justifications for why one method is inherently superior to another (e.g., why MLA is definitively better than GQA). This contrasts with fields like theoretical physics, which rely on deriving results from fundamental axioms. The papers also reveal a surprising similarity in top-line benchmark statistics and the use of common LLM components (attention, activations, embeddings) despite vastly different training methods. Furthermore, reinforcement learning is heavily utilized in post-training for reasoning and alignment across all major models. It's particularly noteworthy how some of these RL efforts achieve strong results with minimal data, such as Qwen's use of just 4,000 data pairs.

The opaque advantage of data set engineering

Despite the public release of model architectures and training details, a significant aspect of these LLMs' success appears to lie in the opacity of their data set engineering. The vast amount of behind-the-scenes work involved in curating and preparing these datasets is likely a substantial part of the competitive moat that allows companies to confidently release their models. It is exceedingly difficult for external parties to replicate the precise data mixtures and quality standards that contribute to the models' advanced capabilities. This makes the data engineering, rather than just the publicly visible algorithmic choices, a critical, yet hard-to-quantify, differentiator.

Size and parameter activation comparisons

When comparing model sizes, Qwen 3 stands out for offering both dense and MoE variants across a wide parameter range, from 6B to 30B (MoE) and up to 235B (MoE). Notably, Qwen's MoE models achieve dense model performance with significantly fewer active parameters. DeepSeek V3 is a large MoE model at 671 billion total parameters, activating 37 billion per token. GPT-OSS sits in the middle, with MoE models of 117 billion (5.1B active) and 21 billion (3.6B active) parameters. These differences in total and active parameters influence efficiency and scalability.

Comparison of Open Source LLM Sizes and Parameter Activation

Data extracted from this episode

Model FamilyVariant TypeTotal ParametersActivated Parameters per TokenContext Window
GPT OSSMixture of Experts117B / 21B5.1B / 3.6B131,000 tokens
Qwen 3Dense6B - 32BAll32,000+ tokens
Qwen 3Mixture of Experts30B / 235BSubset32,000+ tokens
DeepSeek V3Mixture of Experts671B37B128,000 tokens

Common Questions

GPT OSS is OpenAI's first open weights model since GPT2 in 2019. It's a Mixture of Experts model with options for 120 billion or 20 billion parameters and features a large context window achieved through yarn scaling during pre-training.

Topics

Mentioned in this video

More from Y Combinator

View all 562 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free