Key Moments

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Stanford OnlineStanford Online
Education5 min read110 min video
Nov 21, 2025|431,325 views|8,652|180
Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

TL;DR

Large Language Models (LLMs) lack real-world context, are hard to control, and have limited memory; techniques like prompt engineering, RAG, and agents are crucial to overcome these limitations and build more capable AI systems.

Key Insights

1

While base LLMs have broad knowledge, they lack domain-specific knowledge, current information, and can be difficult to control, as exemplified by Microsoft's Tay bot becoming racist within 16 hours.

2

Prompt engineering techniques like chain-of-thought (CoT) encourage step-by-step reasoning, while few-shot prompting aligns LLMs with specific tasks by providing examples within the prompt itself.

3

Retrieval Augmented Generation (RAG) integrates external knowledge sources by embedding documents into a vector database, allowing LLMs to access and cite up-to-date information without retraining, addressing limitations like knowledge cutoffs and hallucinations.

4

Agentic AI workflows move beyond single-step tasks to handle multi-step, autonomous processes by combining prompts, tools (APIs), and memory management (working and archival memory) to achieve complex goals.

5

Evaluating agentic systems involves both quantitative metrics (e.g., success rate of address updates) and qualitative analysis (e.g., user satisfaction, politeness), often using LLM judges to assess subjective aspects.

6

Future AI advancements are expected from architectural search beyond transformers, multimodality (combining text, image, audio, video), and the harmonious integration of various learning methods (supervised, unsupervised, RL) inspired by human learning.

Limitations of base LLMs and the need for augmentation

Standalone Large Language Models (LLMs) like GPT-3.5 Turbo or GPT-4 possess vast general knowledge but suffer from critical limitations. They often lack specific domain expertise, struggle with up-to-date information due to training data cutoffs, and can be difficult to control, leading to potentially controversial or undesirable outputs, as seen with Microsoft's Tay bot which became racist within 16 hours of deployment. Furthermore, LLMs can underperform on narrow, specialized tasks, may produce inconsistent styles, and are constrained by limited context windows, hindering their ability to process large amounts of data necessary for applications like knowledge management. These issues necessitate techniques to augment LLMs, moving beyond basic prompting to more sophisticated methods.

Prompt engineering: Guiding LLM behavior

Prompt engineering is the first line of defense for maximizing LLM performance. Basic principles involve being specific about the desired output, including audience and format. Advanced techniques include few-shot prompting, where examples are provided directly in the prompt to align the model with a specific task or tone (e.g., classifying review sentiment), which is quicker for experimentation than fine-tuning. Chain-of-thought (CoT) prompting encourages the LLM to break down a problem into steps, improving reasoning. Prompt templates can standardize and personalize prompts, while concepts like the 'centaur' and 'cyborg' approaches describe different user interaction styles with AI – delegating large tasks versus iterative, back-and-forth collaboration. The study on consultants showed that training in prompt engineering significantly improved performance, highlighting its practical importance.

Chaining prompts for complex workflows

Chaining involves breaking down a complex task into a sequence of simpler prompts, allowing for better debugging and improved control. For example, a customer service response can be generated in stages: first extracting key issues from a customer review, then drafting an outline for a response, and finally writing the full professional reply. This modular approach allows engineers to identify and fix issues at each step, which is more manageable than debugging a single, monolithic prompt. While chaining can introduce latency, it offers significant advantages in controlling and refining complex AI workflows, facilitating more predictable and robust outcomes. Testing these chained workflows can be done manually or automated using platforms and LLM judges.

Retrieval Augmented Generation (RAG) for factual grounding

RAG addresses LLM limitations by integrating external knowledge sources, such as documents or databases. The process involves embedding documents into a vector database, then using the user's query (also embedded) to retrieve the most relevant passages. These retrieved passages are then added to the LLM's prompt, grounding its response in factual, up-to-date information. This significantly reduces hallucinations and allows for sourcing, which is critical in fields like medicine or law. Techniques like chunking and hypothetical document embeddings (HyDE) improve RAG's effectiveness for handling very large documents or queries that don't perfectly match document phrasing, making LLM outputs more reliable and verifiable.

Agentic AI workflows: Autonomous multi-step tasks

Agentic AI workflows extend LLMs from performing single tasks to handling complex, multi-step processes autonomously. Coined by Andrew Ng, these workflows involve agents that can use tools (APIs), manage memory (working and archival), and follow prompts to achieve goals. An example is a travel booking agent that can search for flights, book hotels, and create an itinerary, interacting with users and external services. The architecture includes prompts, memory systems for context, and tools like APIs. Agents can operate with varying degrees of autonomy, from hardcoded steps to fully autonomous decision-making, potentially even writing code. Model-driven communication protocols like MCP (Model Context Protocol) offer an efficient way for agents to interact with services.

Evaluating and improving agentic systems

Assessing the performance of agentic AI systems is crucial. This involves a mix of quantitative and qualitative evaluations. Quantitative metrics include success rates for tasks (e.g., address updates), latency, and cost. Qualitative assessments, often involving human review or LLM judges, are vital for subjective aspects like politeness, tone, and overall user satisfaction. Error analysis, distinguishing between objective errors (e.g., incorrect order ID lookup) and subjective preferences (e.g., direct vs. indirect flights), helps pinpoint areas for improvement. LLM traces are essential for debugging complex agentic workflows by providing visibility into the sequence of prompts and tool calls.

Multi-agent systems and future trends

Multi-agent systems involve multiple specialized agents working collaboratively, often in parallel, to achieve complex goals, which can improve efficiency and reusability. For example, a smart home system might have agents for climate control, security, and energy management, orchestrated by a central agent. Future trends in AI include architectural search beyond current transformer models to reduce compute reliance, advancements in multimodality where systems learn from text, images, audio, and video synergistically to improve overall understanding, and the harmonious integration of various learning paradigms. The rapid pace of AI development means a focus on breadth of understanding and the ability to quickly dive deep into specific techniques is key, as the half-life of specialized skills is low.

Common Questions

Base LLMs often lack domain-specific knowledge, struggle with current information (cutoff dates), can output inconsistent styles, and have limited context windows, making them prone to knowledge gaps and hallucinations without external augmentation. They also struggle to provide sources for their information.

Topics

Mentioned in this video

More from Stanford Online

View all 76 summaries

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free