Key Moments

Stanford CS25: Transformers United V6 I Overview of Transformers

Stanford OnlineStanford Online
Education6 min read77 min video
Apr 22, 2026|83 views|4
Save to Pod
TL;DR

Transformers, the architecture behind modern AI, are now pervasive but face limitations; future research focuses on alternative architectures like SSMs and learning 'world models' over next-token prediction.

Key Insights

1

Sauna 4-7x/week reduces cardiovascular death risk by 50% vs once/week (BMC Medicine, n=1,688).

2

Large language models (LLMs) are scaled-up transformers, pre-trained on vast internet text data with a next-token prediction objective, leading to emergent abilities like reasoning and few-shot learning.

3

Human children learn language from significantly less data (10-100 million words by age 13) than LLMs, with data quality, structure, and interaction richness being more crucial than quantity at small scales.

4

Retrieval-Augmented Generation (RAG) shows diminishing returns for larger models; small models benefit significantly more, requiring approximately 4 pre-training tokens per parameter before RAG becomes effectively usable.

5

Hallucination in AI is defined as a 'world modeling error,' occurring when a model's internal learned world model contradicts an external reference world model (e.g., source document, real world).

6

Emerging alternative architectures to transformers include State Space Models (SSMs), like Mamba, which offer linear time scaling and better efficiency for long sequences compared to the quadratic complexity of transformers.

Evolution of machine learning towards transformers

Early machine learning (pre-2012) relied on hand-engineered features fed into shallow models. This evolved into supervised deep learning, where models learned directly from raw data, bypassing manual feature extraction. The next significant shift was to self-supervised learning, where models learned by reconstructing corrupted data, enabling the acquisition of general representations applicable to various tasks. In Natural Language Processing (NLP), this progression moved from simple sentiment analysis to next-token prediction, leveraging vast amounts of raw text. Similarly, computer vision saw advancements with techniques like masked autoencoders, which involve masking parts of an image and training the model to reconstruct them. The development of word embeddings (e.g., Word2Vec, GloVe) bridged the gap between human language and machine computation, representing words as dense vectors. However, static embeddings struggled with polysemy (words with multiple meanings); contextual embeddings emerged to address this by considering word context. Recurrent Neural Networks (RNNs) then processed sequences step-by-step with a memory state, but struggled with long-range dependencies, a problem partially addressed by Long Short-Term Memory (LSTM) networks.

The Transformer architecture and its advantages

Transformers revolutionized sequence modeling with the self-attention mechanism, which allows the model to weigh the importance of different tokens in a sequence. This is achieved through query, key, and value matrices that learn relationships between tokens. Positional encodings were introduced to retain information about token order, as self-attention itself is order-agnostic. Multi-headed attention further enhances representation by using multiple sets of attention matrices in parallel. Key advantages of transformers over RNNs include their inherent parallelism, enabling faster processing on GPUs, and their superior ability to handle long contexts, scaling to millions of tokens. Unlike RNNs, which are limited by the information stored in their hidden state, transformers can access all tokens in the context simultaneously, facilitating better understanding of long-range dependencies. This architecture has become foundational across various domains, including large language models (LLMs), computer vision, speech processing, biology, and robotics.

Pre-training: The role of data and scale

Pre-training is the initial phase of training a model on a large, diverse dataset to instill general knowledge and capabilities. For LLMs, this typically involves training on vast amounts of internet text with the objective of next-token prediction. The quality, structure, and strategic utilization of data are paramount, not just its sheer volume. Research on 'BabyLM' and 'BabySkill' highlights that human children learn from vastly less data but benefit from richer, more structured, and interactive language environments, suggesting data quality significantly impacts learning effectiveness, especially at smaller scales. Studies on bilingual and multilingual models demonstrate that adding a second language does not necessarily impair performance in the first, and the exposure structure (e.g., code-switching, sentence-level mixing) is surprisingly irrelevant, with data scale being more influential than model scale at certain sizes. The concept of curricula, where models are trained on progressively more complex data, is also explored as a way to improve learning efficiency. Overall, data-centric approaches, focusing on composition, quality, and structure, are crucial for effective language modeling.

Retrieval-Augmented Generation (RAG) and Curriculum Learning

Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating external documents retrieved by a separate system. Research indicates that RAG provides significant performance gains, particularly for smaller models. However, these benefits saturate for larger models that have already memorized a substantial amount of information. A critical finding is the 'crossover point' where a minimum amount of pre-training (around 4 billion tokens for a 1 billion parameter model) is necessary for models to effectively leverage RAG. Curriculum learning, combined with model growth (Curriculum Guided Layer Scaling - CGLS), shows promise in improving learning efficiency. CGLS involves starting with a smaller model and simpler data, gradually increasing model size and data complexity. This approach demonstrated better performance than training a large model from scratch on all data or using a curriculum without model scaling, particularly for reasoning tasks, up to the 1 billion parameter scale. These findings suggest that balancing parametric learning with external memory through RAG and strategically scaling models alongside data curricula are key to optimizing performance.

Post-training adaptation and decision-making

Post-training focuses on adapting pre-trained models to specific tasks and users. Techniques include fine-tuning and prompt-based methods. Chain-of-Thought (CoT) prompting encourages models to reason step-by-step, revealing deeper reasoning capabilities, and can be extended with methods like Tree of Thoughts or by integrating external tools. Reinforcement learning with human feedback (RLHF) and its variants like Direct Preference Optimization (DPO) train models based on human preferences, though they can be prone to reward hacking. Process supervision offers intermediate rewards for multi-step tasks, improving reasoning accuracy. AI agents, systems that perceive, decide, and act, can self-improve through reflection on their actions and outputs, learning from past mistakes, and utilizing memory stores. They can also integrate external tools like APIs or databases to enhance decision-making and problem-solving. The alignment problem remains critical, ensuring models behave as intended and safely, avoiding shortcuts and unintended behaviors.

Applications beyond language and emerging challenges

Transformers have expanded beyond NLP into computer vision (Vision Transformers - ViTs), which process images by dividing them into patches, and neuroscience, where they analyze complex fMRI data by leveraging network priors for better interpretability and disease insight. CLIP aligns text and image representations, enabling cross-modal understanding. Despite broad success, current models face limitations. Hallucination, defined as 'world modeling errors' where the model's internal representation contradicts external truth, remains a significant challenge, impacting trust and reliability, especially in high-stakes domains. Issues like limited memory, computational complexity that scales quadratically with sequence length, and a lack of true world understanding persist. These limitations highlight the need for further advancements beyond next-token prediction.

The future: World models, State Space Models, and AGI

The pursuit of Artificial General Intelligence (AGI) drives research into new paradigms. 'World models,' such as Japa, aim to learn structured representations of environments and predict future states, moving beyond simple token prediction and enabling more grounded planning and reasoning. State Space Models (SSMs), exemplified by Mamba, offer a more efficient alternative to transformers, featuring linear time scaling with sequence length and improved performance on long-context tasks by maintaining a compressed internal state. While SSMs present trade-offs in flexibility, they represent a promising direction. Key challenges to AGI include developing more robust long-term memory, achieving computational efficiency, enhancing model interpretability, ensuring robust alignment with human values, and overcoming limitations in current scaling laws. Research in continual learning, model editing, scalable oversight, and constitutional AI are ongoing efforts to address these complex issues and build more capable, reliable, and adaptable AI systems.

Common Questions

The CS25 course, 'Transformers United', focuses on the architecture and applications of Transformers, a key component in modern AI and machine learning systems. It aims to disseminate knowledge from experts in the field.

Topics

Mentioned in this video

Software & Apps
Word2Vec

A typical method for creating word embeddings, representing words as dense vectors in a high-dimensional space.

RNNs

Recurrent Neural Networks, sequence models that process input step-by-step while maintaining a hidden state, but suffer from long-range dependency issues.

MongoDB

Sponsor of the CS25 course, partnering with Modal AI House to provide students with opportunities to connect with AI leaders.

Mamba

A selective state space model architecture that models sequences using continuous state updates, offering linear time scaling.

AlphaFold

A DeepMind model used in genetics.

Python

A programming language mentioned as an example of an interpreter that can be used to help solve problems generated as intermediate reasoning steps.

fMRI

Functional Magnetic Resonance Imaging, used in neuroscience to capture changes in blood oxygenation across the brain.

AlphaGo

A DeepMind model used for playing chess.

Japa

A proposed architecture by Yannic that moves beyond next token prediction to learn how the world works by predicting latent representations.

Claude

Mentioned as one of the large language models used today, similar to GPT-5.

GPT-5

Mentioned as a large language model that requires millions of dollars to train, contrasting with the potential of smaller models.

RoBERTa

Mentioned as a benchmark for scaling laws in language models, pre-trained on approximately 20-30 billion tokens.

GPT-4

Mentioned as an example of a large language model that is computationally expensive to train.

CLIP

A model that aligns text and image representations by encoding them into vectors and updating models through paired data.

FastText

A typical method for creating word embeddings, representing words as dense vectors in a high-dimensional space.

LSTMs

Long Short-Term Memory networks, a gated variant of RNNs designed to better preserve long-range dependencies and mitigate forgetting problems.

Halo World

A benchmark being developed to evaluate hallucination models, based on a unified definition of hallucination as incorrect world modeling.

Lido

A platform mentioned for asking questions during the Zoom lecture.

More from Stanford Online

View all 27 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free