Key Moments
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Large Language Models (LLMs) lack real-world context, are hard to control, and have limited memory; techniques like prompt engineering, RAG, and agents are crucial to overcome these limitations and build more capable AI systems.
Key Insights
While base LLMs have broad knowledge, they lack domain-specific knowledge, current information, and can be difficult to control, as exemplified by Microsoft's Tay bot becoming racist within 16 hours.
Prompt engineering techniques like chain-of-thought (CoT) encourage step-by-step reasoning, while few-shot prompting aligns LLMs with specific tasks by providing examples within the prompt itself.
Retrieval Augmented Generation (RAG) integrates external knowledge sources by embedding documents into a vector database, allowing LLMs to access and cite up-to-date information without retraining, addressing limitations like knowledge cutoffs and hallucinations.
Agentic AI workflows move beyond single-step tasks to handle multi-step, autonomous processes by combining prompts, tools (APIs), and memory management (working and archival memory) to achieve complex goals.
Evaluating agentic systems involves both quantitative metrics (e.g., success rate of address updates) and qualitative analysis (e.g., user satisfaction, politeness), often using LLM judges to assess subjective aspects.
Future AI advancements are expected from architectural search beyond transformers, multimodality (combining text, image, audio, video), and the harmonious integration of various learning methods (supervised, unsupervised, RL) inspired by human learning.
Limitations of base LLMs and the need for augmentation
Standalone Large Language Models (LLMs) like GPT-3.5 Turbo or GPT-4 possess vast general knowledge but suffer from critical limitations. They often lack specific domain expertise, struggle with up-to-date information due to training data cutoffs, and can be difficult to control, leading to potentially controversial or undesirable outputs, as seen with Microsoft's Tay bot which became racist within 16 hours of deployment. Furthermore, LLMs can underperform on narrow, specialized tasks, may produce inconsistent styles, and are constrained by limited context windows, hindering their ability to process large amounts of data necessary for applications like knowledge management. These issues necessitate techniques to augment LLMs, moving beyond basic prompting to more sophisticated methods.
Prompt engineering: Guiding LLM behavior
Prompt engineering is the first line of defense for maximizing LLM performance. Basic principles involve being specific about the desired output, including audience and format. Advanced techniques include few-shot prompting, where examples are provided directly in the prompt to align the model with a specific task or tone (e.g., classifying review sentiment), which is quicker for experimentation than fine-tuning. Chain-of-thought (CoT) prompting encourages the LLM to break down a problem into steps, improving reasoning. Prompt templates can standardize and personalize prompts, while concepts like the 'centaur' and 'cyborg' approaches describe different user interaction styles with AI – delegating large tasks versus iterative, back-and-forth collaboration. The study on consultants showed that training in prompt engineering significantly improved performance, highlighting its practical importance.
Chaining prompts for complex workflows
Chaining involves breaking down a complex task into a sequence of simpler prompts, allowing for better debugging and improved control. For example, a customer service response can be generated in stages: first extracting key issues from a customer review, then drafting an outline for a response, and finally writing the full professional reply. This modular approach allows engineers to identify and fix issues at each step, which is more manageable than debugging a single, monolithic prompt. While chaining can introduce latency, it offers significant advantages in controlling and refining complex AI workflows, facilitating more predictable and robust outcomes. Testing these chained workflows can be done manually or automated using platforms and LLM judges.
Retrieval Augmented Generation (RAG) for factual grounding
RAG addresses LLM limitations by integrating external knowledge sources, such as documents or databases. The process involves embedding documents into a vector database, then using the user's query (also embedded) to retrieve the most relevant passages. These retrieved passages are then added to the LLM's prompt, grounding its response in factual, up-to-date information. This significantly reduces hallucinations and allows for sourcing, which is critical in fields like medicine or law. Techniques like chunking and hypothetical document embeddings (HyDE) improve RAG's effectiveness for handling very large documents or queries that don't perfectly match document phrasing, making LLM outputs more reliable and verifiable.
Agentic AI workflows: Autonomous multi-step tasks
Agentic AI workflows extend LLMs from performing single tasks to handling complex, multi-step processes autonomously. Coined by Andrew Ng, these workflows involve agents that can use tools (APIs), manage memory (working and archival), and follow prompts to achieve goals. An example is a travel booking agent that can search for flights, book hotels, and create an itinerary, interacting with users and external services. The architecture includes prompts, memory systems for context, and tools like APIs. Agents can operate with varying degrees of autonomy, from hardcoded steps to fully autonomous decision-making, potentially even writing code. Model-driven communication protocols like MCP (Model Context Protocol) offer an efficient way for agents to interact with services.
Evaluating and improving agentic systems
Assessing the performance of agentic AI systems is crucial. This involves a mix of quantitative and qualitative evaluations. Quantitative metrics include success rates for tasks (e.g., address updates), latency, and cost. Qualitative assessments, often involving human review or LLM judges, are vital for subjective aspects like politeness, tone, and overall user satisfaction. Error analysis, distinguishing between objective errors (e.g., incorrect order ID lookup) and subjective preferences (e.g., direct vs. indirect flights), helps pinpoint areas for improvement. LLM traces are essential for debugging complex agentic workflows by providing visibility into the sequence of prompts and tool calls.
Multi-agent systems and future trends
Multi-agent systems involve multiple specialized agents working collaboratively, often in parallel, to achieve complex goals, which can improve efficiency and reusability. For example, a smart home system might have agents for climate control, security, and energy management, orchestrated by a central agent. Future trends in AI include architectural search beyond current transformer models to reduce compute reliance, advancements in multimodality where systems learn from text, images, audio, and video synergistically to improve overall understanding, and the harmonious integration of various learning paradigms. The rapid pace of AI development means a focus on breadth of understanding and the ability to quickly dive deep into specific techniques is key, as the half-life of specialized skills is low.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●Concepts
●People Referenced
Common Questions
Base LLMs often lack domain-specific knowledge, struggle with current information (cutoff dates), can output inconsistent styles, and have limited context windows, making them prone to knowledge gaps and hallucinations without external augmentation. They also struggle to provide sources for their information.
Topics
Mentioned in this video
The Stanford Deep Learning course where this lecture is given.
Generative Adversarial Networks, mentioned in the context of distribution shifts between training data and real-world data.
An AI benchmark problem designed to test an LLM's ability to retrieve a specific fact from a very large text corpus.
A popular prompting method where the model is instructed to think step-by-step to improve performance and control.
A technique to improve RAGs where a user query is used to generate a fake, hallucinated document, which is then embedded to find closer vector matches in the database.
A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward, which has a specific definition in RL separate from agentic workflows.
A pre-trained large language model (LLM) mentioned as an example of a base model with limitations.
A pre-trained large language model (LLM) from OpenAI, used as an example of an improving base model.
An LLM from Elon Musk's xAI, mentioned in the context of political bias debates and as an alternative LLM for evaluation.
Upcoming LLM from OpenAI, mentioned as the next iteration of foundation models expected to improve performance, possibly by packaging other models.
A platform used by the instructor's team to automate part of prompt testing, enabling running prompts on different LLMs and using LLM judges.
An open-source LLM by Meta, mentioned as an alternative model to compare against GPT-4 and Grok for politeness evaluations.
OpenAI's chatbot interface, mentioned as a potential user of hidden system prompts for user interaction.
Former US President, whose 'Kovfefe' tweet is an example of an LLM not being up-to-date with new words/trends.
Mentioned in a debate with Sam Altman about the political bias of their respective LLMs, Grok and OpenAI models.
Co-founder of OpenAI, mentioned in a debate with Elon Musk about the political bias of their LLMs.
Coined the term 'agentic AI workflows' to bring clarity to the diverse interpretations of 'agents' within the industry.
OpenAI co-founder, who raised the question about whether LLMs are plateauing in their performance improvements.
Provided an example of Slack fine-tuning gone wrong, where a model trained on company Slack messages started acting too human.
Author of the 'Foundation' series, whose work is referenced to illustrate how individuals can have a tremendous impact on the future through their decisions, similar to the discovery of transformers in AI.
The company that created the Tay Twitter bot, which quickly became racist, highlighting the difficulty in controlling LLMs.
The company developing LLMs like GPT-4, mentioned in the context of control issues with their models and prompt templates.
The company that coined the term 'MCP' (Model Context Protocol) to describe a system that simplifies LLM communication with endpoints.
Online platform where prompt repositories, such as "awesome prompt template repo," can be found for free, offering examples of effective prompts.
Co-authored a study with HPS and UPenn on consultant performance with AI access and prompt engineering training.
Involved in a study on BCG consultants and AI usage, contributing insights on prompt engineering's impact.
Consultants from BCG were part of a study evaluating AI's impact on human performance with different levels of prompt engineering training.
More from Stanford Online
View all 76 summaries
60 minStanford CS547 HCI Seminar | Spring 2026 | The Modern Motivators of Play
72 minStanford CS336 Language Modeling from Scratch | Spring 2026 | Guest Lecture: Dan Fu
50 minStanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Applications, Applied AI
64 minStanford Robotics Seminar ENGR319 | Spring 2026 | Leveraging Geometry in Robot Learning
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free