What is the difference between centaurs and cyborgs in AI interaction?

Centaurs are individuals who delegate large, distinct tasks to AI, then review the complete output, much like a project manager. Cyborgs integrate AI into their workflow in a tightly blended, back-and-forth manner, rapidly iterating with the model on smaller tasks.

How can 'few-shot' prompting improve LLM performance for specific tasks?

Few-shot prompting provides the LLM with a few examples of desired input-output pairs directly within the prompt. This helps align the model's understanding to the specific task and desired tone or format, especially for subjective classifications where a zero-shot approach might be ambiguous.

Why is 'chaining prompts' considered more effective than a single complex prompt?

Chaining prompts involves breaking a complex task into multiple smaller, sequential prompts. This modular approach allows for easier debugging by isolating issues to specific steps, better control over the workflow, and more straightforward optimization as each sub-prompt can be evaluated and refined independently.

What is Retrieval Augmented Generation (RAG) and why is it important?

RAG integrates external knowledge sources (like databases or documents) with an LLM. It allows the LLM to retrieve relevant information from these sources and use it as context for its response, making answers more accurate, up-to-date, grounded in factual data, and capable of providing sources, while also addressing context window limitations.

What is the key difference between a vanilla RAG and improved RAG techniques like Hypothetical Document Embeddings (HIDE) or chunking?

A vanilla RAG embeds full documents and retrieves them based on query similarity. Improved techniques like chunking involve breaking large documents into smaller, searchable segments (e.g., chapters) to provide more precise context. HIDE, on the other hand, generates a hypothetical document from the user's query to create an embedding that is more likely to match the dense document embeddings, addressing the mismatch between short queries and long documents.

What is the 'paradigm shift' in software engineering due to agentic AI?

The shift involves moving from traditional deterministic software dealing with structured data to fuzzy, dynamic agentic AI software that handles free-form text and images with dynamic interpretation. This requires a shift in mindset, treating software like a manager delegating tasks to specialized 'agents' rather than fixed engineering boxes, and embracing a higher tolerance for throwing away code due to faster experimentation cycles.

What are the core components of an AI agent?

An AI agent typically consists of optimized prompts, a context management system (memory, including working and archival memory) to retain user history and preferences, and a suite of tools (APIs or Model Context Protocols) to interact with external systems like flight searches, databases, or payment processing.

How does a multi-agent workflow differ from a single agent with multiple steps?

While a single agent can perform multi-step tasks, a multi-agent workflow leverages multiple, specialized agents that can run tasks in parallel. This often improves efficiency through parallelization and allows agents to be reused across different teams or contexts within an organization, making debugging specialized components easier.

What are some key trends shaping the future of AI?

Future trends include finding new architectural designs beyond transformers to overcome current plateauing in LLM performance, leveraging multimodality (integrating text, images, audio, video) for holistic model improvement, combining various learning methods (supervised, unsupervised, self-supervised, reinforcement learning), and exploring both human-centric and non-human-centric approaches to AI development. The rapid velocity of change also necessitates continuous learning.

Key Moments

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Stanford Online

Education5 min read110 min video

Nov 21, 2025|453,849 views|9,048|183

Stanford Stanford Online Artificial Intelligence AI

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Large Language Models (LLMs) lack real-world context, are hard to control, and have limited memory; techniques like prompt engineering, RAG, and agents are crucial to overcome these limitations and build more capable AI systems.

Key Insights

While base LLMs have broad knowledge, they lack domain-specific knowledge, current information, and can be difficult to control, as exemplified by Microsoft's Tay bot becoming racist within 16 hours.

Prompt engineering techniques like chain-of-thought (CoT) encourage step-by-step reasoning, while few-shot prompting aligns LLMs with specific tasks by providing examples within the prompt itself.

Retrieval Augmented Generation (RAG) integrates external knowledge sources by embedding documents into a vector database, allowing LLMs to access and cite up-to-date information without retraining, addressing limitations like knowledge cutoffs and hallucinations.

Agentic AI workflows move beyond single-step tasks to handle multi-step, autonomous processes by combining prompts, tools (APIs), and memory management (working and archival memory) to achieve complex goals.

Evaluating agentic systems involves both quantitative metrics (e.g., success rate of address updates) and qualitative analysis (e.g., user satisfaction, politeness), often using LLM judges to assess subjective aspects.

Future AI advancements are expected from architectural search beyond transformers, multimodality (combining text, image, audio, video), and the harmonious integration of various learning methods (supervised, unsupervised, RL) inspired by human learning.

Limitations of base LLMs and the need for augmentation

Standalone Large Language Models (LLMs) like GPT-3.5 Turbo or GPT-4 possess vast general knowledge but suffer from critical limitations. They often lack specific domain expertise, struggle with up-to-date information due to training data cutoffs, and can be difficult to control, leading to potentially controversial or undesirable outputs, as seen with Microsoft's Tay bot which became racist within 16 hours of deployment. Furthermore, LLMs can underperform on narrow, specialized tasks, may produce inconsistent styles, and are constrained by limited context windows, hindering their ability to process large amounts of data necessary for applications like knowledge management. These issues necessitate techniques to augment LLMs, moving beyond basic prompting to more sophisticated methods.

Prompt engineering: Guiding LLM behavior

Prompt engineering is the first line of defense for maximizing LLM performance. Basic principles involve being specific about the desired output, including audience and format. Advanced techniques include few-shot prompting, where examples are provided directly in the prompt to align the model with a specific task or tone (e.g., classifying review sentiment), which is quicker for experimentation than fine-tuning. Chain-of-thought (CoT) prompting encourages the LLM to break down a problem into steps, improving reasoning. Prompt templates can standardize and personalize prompts, while concepts like the 'centaur' and 'cyborg' approaches describe different user interaction styles with AI – delegating large tasks versus iterative, back-and-forth collaboration. The study on consultants showed that training in prompt engineering significantly improved performance, highlighting its practical importance.

Chaining prompts for complex workflows

Chaining involves breaking down a complex task into a sequence of simpler prompts, allowing for better debugging and improved control. For example, a customer service response can be generated in stages: first extracting key issues from a customer review, then drafting an outline for a response, and finally writing the full professional reply. This modular approach allows engineers to identify and fix issues at each step, which is more manageable than debugging a single, monolithic prompt. While chaining can introduce latency, it offers significant advantages in controlling and refining complex AI workflows, facilitating more predictable and robust outcomes. Testing these chained workflows can be done manually or automated using platforms and LLM judges.

Retrieval Augmented Generation (RAG) for factual grounding

RAG addresses LLM limitations by integrating external knowledge sources, such as documents or databases. The process involves embedding documents into a vector database, then using the user's query (also embedded) to retrieve the most relevant passages. These retrieved passages are then added to the LLM's prompt, grounding its response in factual, up-to-date information. This significantly reduces hallucinations and allows for sourcing, which is critical in fields like medicine or law. Techniques like chunking and hypothetical document embeddings (HyDE) improve RAG's effectiveness for handling very large documents or queries that don't perfectly match document phrasing, making LLM outputs more reliable and verifiable.

Agentic AI workflows: Autonomous multi-step tasks

Agentic AI workflows extend LLMs from performing single tasks to handling complex, multi-step processes autonomously. Coined by Andrew Ng, these workflows involve agents that can use tools (APIs), manage memory (working and archival), and follow prompts to achieve goals. An example is a travel booking agent that can search for flights, book hotels, and create an itinerary, interacting with users and external services. The architecture includes prompts, memory systems for context, and tools like APIs. Agents can operate with varying degrees of autonomy, from hardcoded steps to fully autonomous decision-making, potentially even writing code. Model-driven communication protocols like MCP (Model Context Protocol) offer an efficient way for agents to interact with services.

Evaluating and improving agentic systems

Assessing the performance of agentic AI systems is crucial. This involves a mix of quantitative and qualitative evaluations. Quantitative metrics include success rates for tasks (e.g., address updates), latency, and cost. Qualitative assessments, often involving human review or LLM judges, are vital for subjective aspects like politeness, tone, and overall user satisfaction. Error analysis, distinguishing between objective errors (e.g., incorrect order ID lookup) and subjective preferences (e.g., direct vs. indirect flights), helps pinpoint areas for improvement. LLM traces are essential for debugging complex agentic workflows by providing visibility into the sequence of prompts and tool calls.

Multi-agent systems and future trends

Multi-agent systems involve multiple specialized agents working collaboratively, often in parallel, to achieve complex goals, which can improve efficiency and reusability. For example, a smart home system might have agents for climate control, security, and energy management, orchestrated by a central agent. Future trends in AI include architectural search beyond current transformer models to reduce compute reliance, advancements in multimodality where systems learn from text, images, audio, and video synergistically to improve overall understanding, and the harmonious integration of various learning paradigms. The rapid pace of AI development means a focus on breadth of understanding and the ability to quickly dive deep into specific techniques is key, as the half-life of specialized skills is low.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Base LLMs often lack domain-specific knowledge, struggle with current information (cutoff dates), can output inconsistent styles, and have limited context windows, making them prone to knowledge gaps and hallucinations without external augmentation. They also struggle to provide sources for their information.

Topics

AI & Machine Learning Programming & Software Business & Entrepreneurship Prompt Engineering Multi-Agent Systems AI Architecture Retrieval Augmented Generation (RAG)LLM Applications LLM Optimization Fine-tuning Models Agentic AI Workflows AI Evaluation (Evals)

Mentioned in this video

Concepts

CS230

The Stanford Deep Learning course where this lecture is given.

GANs

Generative Adversarial Networks, mentioned in the context of distribution shifts between training data and real-world data.

Needle in a Haystack

An AI benchmark problem designed to test an LLM's ability to retrieve a specific fact from a very large text corpus.

Chain of Thought

A popular prompting method where the model is instructed to think step-by-step to improve performance and control.

Hypothetical Document Embeddings

A technique to improve RAGs where a user query is used to generate a fake, hallucinated document, which is then embedded to find closer vector matches in the database.

Reinforcement Learning

A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward, which has a specific definition in RL separate from agentic workflows.

Software & Apps

GPT-3.5 Turbo

A pre-trained large language model (LLM) mentioned as an example of a base model with limitations.

GPT-4o

A pre-trained large language model (LLM) from OpenAI, used as an example of an improving base model.

Grok

An LLM from Elon Musk's xAI, mentioned in the context of political bias debates and as an alternative LLM for evaluation.

GPT-5

Upcoming LLM from OpenAI, mentioned as the next iteration of foundation models expected to improve performance, possibly by packaging other models.

Prompt Fu

A platform used by the instructor's team to automate part of prompt testing, enabling running prompts on different LLMs and using LLM judges.

Llama

An open-source LLM by Meta, mentioned as an alternative model to compare against GPT-4 and Grok for politeness evaluations.

ChatGPT

OpenAI's chatbot interface, mentioned as a potential user of hidden system prompts for user interaction.

People

Donald Trump

Former US President, whose 'Kovfefe' tweet is an example of an LLM not being up-to-date with new words/trends.

Elon Musk

Mentioned in a debate with Sam Altman about the political bias of their respective LLMs, Grok and OpenAI models.

Sam Altman

Co-founder of OpenAI, mentioned in a debate with Elon Musk about the political bias of their LLMs.

Andrew Ng

Coined the term 'agentic AI workflows' to bring clarity to the diverse interpretations of 'agents' within the industry.

Ilya Sutskever

OpenAI co-founder, who raised the question about whether LLMs are plateauing in their performance improvements.

Ross Lazerovitz

Provided an example of Slack fine-tuning gone wrong, where a model trained on company Slack messages started acting too human.

Isaac Asimov

Author of the 'Foundation' series, whose work is referenced to illustrate how individuals can have a tremendous impact on the future through their decisions, similar to the discovery of transformers in AI.

Companies

Microsoft

The company that created the Tay Twitter bot, which quickly became racist, highlighting the difficulty in controlling LLMs.

OpenAI

The company developing LLMs like GPT-4, mentioned in the context of control issues with their models and prompt templates.

Anthropic

The company that coined the term 'MCP' (Model Context Protocol) to describe a system that simplifies LLM communication with endpoints.

GitHub

Online platform where prompt repositories, such as "awesome prompt template repo," can be found for free, offering examples of effective prompts.

Organizations

Harvard Business School

Co-authored a study with HPS and UPenn on consultant performance with AI access and prompt engineering training.

Wharton School

Involved in a study on BCG consultants and AI usage, contributing insights on prompt engineering's impact.

BCG

Consultants from BCG were part of a study evaluating AI's impact on human performance with different levels of prompt engineering training.

Books

Foundation series

A science fiction series by Isaac Asimov, mentioned to illustrate the concept of individuals having a major impact on the future, analogized to the discovery of critical AI architectures like transformers.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free