What are the key architectural features of Qwen 3?

Qwen 3, from Alibaba Cloud, offers both dense and mixture of expert models. It incorporates features like grouped query attention, swiggloo, RoPE, and RMS norm, and introduces QK norm to stabilize attention scores at scale.

How does DeepSeek V3 and V3.1 differ from other open source LLMs?

DeepSeek V3 is a massive 671 billion parameter MoE model trained natively in 8-bit for cost efficiency. V3.1 further enhances it with a two-phase long context training and a hybrid thinking mode. It notably uses MLA for attention mechanism to compress KV cache.

What are the main differences in context length extension strategies between GPT OSS, Qwen 3, and DeepSeek V3?

GPT OSS applies yarn scaling during pre-training for its 131K token context. DeepSeek uses a staged fine-tuning approach to reach 128K tokens. Qwen 3 fine-tunes to 32K then applies yarn scaling at inference time to reach 128K without additional retraining.

Why is it difficult to understand the exact advantages of different deep learning tools?

Many labs describe combinations of tools that work well empirically but lack first-principles justifications for why one is superior, unlike fields like theoretical physics. This makes it hard to replicate or fully understand the differences in performance.

What is the significance of reinforcement learning and data set engineering in modern LLM development?

Reinforcement learning, even with small datasets like 4,000 pairs for Qwen, plays a crucial role in post-training and reasoning. Data set engineering is heavily opaque and is a significant 'moat' that makes these models hard to replicate, contributing greatly to their performance.

Key Moments

OpenAI vs. Deepseek vs. Qwen: Comparing Open Source LLM Architectures

Y Combinator

Science & Technology7 min read13 min video

Aug 29, 2025|32,826 views|805|16

YC Y Combinator

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

On this page

TL;DR

Open-source LLMs like GPT-OSS, Qwen 3, and DeepSeek V3.1 achieve similar top-line benchmark performance despite using vastly different architectural and training techniques, highlighting the opaque nature of data engineering as a key differentiator.

Key Insights

GPT-OSS utilizes a Mixture of Experts (MoE) architecture with 120B/20B parameters, activating the top four experts per token for efficient inference.

Qwen 3 employs a multi-stage pre-training approach on 36 trillion tokens, including a specific long-context stage and a four-step post-training pipeline.

DeepSeek V3.1, an evolution of V3, uses MLA (Multi-head Latent Attention) to compress keys and values, claiming greater memory savings and performance than Grouped Query Attention (GQA) in long context models.

The context window extension strategies differ significantly: GPT-OSS integrates Yarn scaling during pre-training for a native 131,000 token window, DeepSeek trains in phases to 128,000 tokens, and Qwen uses inference-time scaling for its 128,000 token capability.

Reinforcement learning (RL) plays a crucial role in post-training for all major models, with surprisingly small data requirements, such as Qwen's use of only 4,000 query-verifier pairs.

Despite similar benchmark results and common use of core LLM components, the papers offer few first-principles justifications for why specific techniques are superior, indicating data set engineering is a significant, opaque competitive advantage.

OpenAI's GPT-OSS introduces a Mixture of Experts architecture with long context capabilities

OpenAI's recent release of GPT-OSS marks its first open-weights model since GPT-2. Architecturally, it's a Mixture of Experts (MoE) model available in 120 billion and 20 billion parameter sizes. In MoE models, only a subset of parameters is activated for processing each token, enabling efficient inference. GPT-OSS activates the top four experts per token. It is a decoder-only transformer incorporating modern LLM features such as grouped query attention (GQA) for memory efficiency, Swiggloo activations for nuanced transformations, rotary positional embeddings (RoPE) for encoding positional information, and RMS norm with pre-normalization for stable training. A standout feature is its extensive 131,000 token context window, achieved by applying Yarn scaling directly during pre-training, allowing the model to natively handle very long sequences. The training data is a text-only corpus in the trillions of tokens, with a focus on STEM, coding, and general knowledge, though specific details remain undisclosed. By default, GPT-OSS is released in a quantized format, making it deployable on consumer hardware, but no unquantized version is provided. It has also undergone significant post-training for safety and alignment, with some in the community experimenting with removing these layers to explore its raw capabilities. This release positions GPT-OSS as a fully equipped, long-context model ready for immediate use in the open-source AI landscape.

Qwen 3 family offers diverse sizes and a sophisticated multi-stage training pipeline

Alibaba Cloud's Qwen 3 family, released in April, presents a range of models including both dense and MoE architectures. The dense models vary in size from 6 billion to 32 billion parameters, while the MoE models come in two sizes, featuring 128 experts with eight activated per token. Similar to GPT-OSS, Qwen 3 dense models incorporate GQA, Swiggloo, RoPE, and RMS norm. A key innovation in Qwen 3 is the use of QK norm, which dynamically rescales query and key vectors to maintain stable attention scores at scale, replacing the static QKV bias used in previous versions. The models were trained on 36 trillion pre-training tokens. Qwen 3's training occurred in three stages: a general stage with over 30 trillion tokens across 119 languages at 4096 sequence length; a reasoning stage with 5 trillion higher-quality tokens focused on STEM, reasoning, and coding; and a long context stage where context length was extended to over 32,000 tokens using algorithmic optimizations like ABF (Adaptive RoPE Frequency) and Yarn scaling. Post-training involves a four-step pipeline: a cold start stage for challenging reasoning problems, a reasoning RL stage using GRPO on ~4,000 query-verifier pairs, a thinking mode fusion technique to integrate reasoning and non-reasoning capabilities, and a general RL stage for broad instruction following and tool use. The models also leverage strong-to-weak distillation for creating smaller, capable versions. This complex pipeline results in impressive performance, especially given the models' sizes.

DeepSeek V3.1 builds upon V3 with enhanced long-context and hybrid inference capabilities

Released in December of the previous year, DeepSeek's V3 model was a significant open-source LLM. It is a massive 671 billion parameter MoE model, designed for both efficiency and capability, paving the way for its reasoning-focused R1 model. Key architectural and training advantages of V3 include its native 8-bit training, which drastically cuts costs. The updated V3.1 version, released recently, extends the original V3 checkpoint by incorporating a two-phase long context training approach and a hybrid thinking mode that allows for switching between reasoning-heavy and lightweight inference. It also boasts improved tool use and stronger reasoning through advanced post-training. A notable technical difference in V3 compared to GPT-OSS and Qwen 3 is its use of MLA (Multi-head Latent Attention). MLA compresses keys and values into a smaller latent space before caching, then decompresses them during inference. The DeepSeek V2 paper indicated that MLA offers greater memory savings and better modeling performance than GQA, particularly for very long contexts. This focus on optimizing the KV cache and attention mechanism is a key differentiator for DeepSeek.

Divergent approaches to extending context length

The strategies for extending context length vary significantly among these models. GPT-OSS is designed with long context natively, using Yarn scaling from pre-training to achieve its 131,000 token window. DeepSeek adopts a staged approach, fine-tuning its models to first reach 32,000 tokens and then further training to extend to 128,000 tokens. Qwen, on the other hand, fine-tunes to 32,000 tokens but then uses Yarn scaling at inference time to push the context to 128,000 tokens without additional retraining. This means GPT-OSS is built for long context from the start, DeepSeek is trained into it incrementally, and Qwen leverages inference-time optimizations on a base model. These different methods have implications for how the models handle and reason over extended sequences.

The empirical nature of LLM development and the role of reinforcement learning

A striking observation across these advanced LLM papers is their empirical nature. Labs describe combinations of tools and techniques that work well for them, but often lack first-principles justifications for why one method is inherently superior to another (e.g., why MLA is definitively better than GQA). This contrasts with fields like theoretical physics, which rely on deriving results from fundamental axioms. The papers also reveal a surprising similarity in top-line benchmark statistics and the use of common LLM components (attention, activations, embeddings) despite vastly different training methods. Furthermore, reinforcement learning is heavily utilized in post-training for reasoning and alignment across all major models. It's particularly noteworthy how some of these RL efforts achieve strong results with minimal data, such as Qwen's use of just 4,000 data pairs.

The opaque advantage of data set engineering

Despite the public release of model architectures and training details, a significant aspect of these LLMs' success appears to lie in the opacity of their data set engineering. The vast amount of behind-the-scenes work involved in curating and preparing these datasets is likely a substantial part of the competitive moat that allows companies to confidently release their models. It is exceedingly difficult for external parties to replicate the precise data mixtures and quality standards that contribute to the models' advanced capabilities. This makes the data engineering, rather than just the publicly visible algorithmic choices, a critical, yet hard-to-quantify, differentiator.

Size and parameter activation comparisons

When comparing model sizes, Qwen 3 stands out for offering both dense and MoE variants across a wide parameter range, from 6B to 30B (MoE) and up to 235B (MoE). Notably, Qwen's MoE models achieve dense model performance with significantly fewer active parameters. DeepSeek V3 is a large MoE model at 671 billion total parameters, activating 37 billion per token. GPT-OSS sits in the middle, with MoE models of 117 billion (5.1B active) and 21 billion (3.6B active) parameters. These differences in total and active parameters influence efficiency and scalability.

Mentioned in This Episode

●Software & Apps

●Companies

●Concepts

●People Referenced

Comparison of Open Source LLM Sizes and Parameter Activation

Data extracted from this episode

Model Family	Variant Type	Total Parameters	Activated Parameters per Token	Context Window
GPT OSS	Mixture of Experts	117B / 21B	5.1B / 3.6B	131,000 tokens
Qwen 3	Dense	6B - 32B	All	32,000+ tokens
Qwen 3	Mixture of Experts	30B / 235B	Subset	32,000+ tokens
DeepSeek V3	Mixture of Experts	671B	37B	128,000 tokens

Common Questions

GPT OSS is OpenAI's first open weights model since GPT2 in 2019. It's a Mixture of Experts model with options for 120 billion or 20 billion parameters and features a large context window achieved through yarn scaling during pre-training.

Topics

Mixture Of Experts AI & Machine Learning Technology & Innovation Science & Mathematics Open-source AI Large Language Models Model Training Context Window Transformer Models LLM Architecture Attention Mechanisms

Mentioned in this video

Concepts

Grouped Query Attention

A modified attention mechanism used in GPT OSS that allows multiple query heads to share key-value pairs, reducing memory use and speeding up inference.

long chain of thought cold start stage

The first step in Qwen 3's post-training pipeline, involving feeding the model challenging reasoning problems with verifiable answers.

Yet Another Rope Extension

A technique for extending the context length of models by tweaking the frequency of rotary positional embeddings to handle longer sequences.

dual chunk attention

An optimization technique employed in Qwen 3's long context stage to efficiently process sequences.

strong to weak distillation

A technique used by Qwen's developers for training smaller models from larger ones.

Yarn scaling

A technique applied during pre-training in GPT OSS to achieve a 131,000 token context window by scaling rotary positional embeddings.

QK norm

A normalization step in Qwen 3 that dynamically rescales query and key vectors to maintain constant magnitudes, replacing QKV bias.

thinking mode fusion

A key innovation in Qwen 3's post-training pipeline that integrates reasoning and non-reasoning modes into a single model.

Rotary Positional Embeddings

Embeddings used in GPT OSS that encode token position directly into the attention mechanism to support longer contexts.

Software & Apps

Qwen 3

A family of models developed by Alibaba Cloud, featuring both dense and mixture of expert variants, with benchmark scores rivaling leading open source models.

LLaMA 4

An open source model mentioned as a benchmark rival to Qwen 3.

Google Gemma 3

A high-performing open source model that was not discussed in detail but mentioned as an example of other available models.

Companies

OpenAI

The AI research lab that developed GPT OSS, a new open weights model.

Alibaba Cloud

The company that developed the Qwen 3 family of models.

People

DeepSeek researchers

The developers of the GRPO RL algorithm, which is used in Qwen 3's training.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free