Why does QMDP generally provide an upper bound on the true value function?

Because QMDP assumes perfect knowledge of the state at the next time step, effectively removing the value of information-gathering actions. Since the true POMDP must account for imperfect state information and possible observations, its optimal value cannot exceed the QMDP upper bound.

What is a belief state in POMDPs and how is it used to decide actions?

A belief state is a probability distribution over possible world states given the history of observations and actions. In a POMDP policy, you pick actions based on this belief rather than a known state, typically by evaluating which action maximizes the expected value given the belief.

What is Fast Informed Bound (FIB) and how does it improve on QMDP?

FIB tightens the upper bound by incorporating the observation model and summing over possible observations at the next step. It is more computationally intensive than QMDP but yields a tighter, more informative bound on the true value.

What is the crying baby problem in this lecture, and what do alpha vectors represent there?

The crying baby problem is a simple two-state POMDP with states like hungry/not hungry, observations like crying or not, and actions like feed or ignore. Each alpha vector represents the expected utility for being in a particular state and following the policy; across beliefs, you pick the action that maximizes the belief-weighted value.

What is PBVI and how does it differ from exact POMDP solutions?

PBVI (Point-Based Value Iteration) maintains a finite set of alpha-vectors, each associated with a belief point, and iteratively backups to improve the value function only at those points. It provides a tractable, but approximate, bound on the true value function for large POMDPs where exact methods are infeasible.

What is randomized PBVI and why use it?

Randomized PBVI updates backups for randomly selected beliefs rather than every belief at every iteration, reducing computational cost. It still aims to improve the overall policy but with fewer backups per iteration.

How are belief points chosen in PBVI, and what are random and exploratory belief expansions?

Belief points should be reachable from the initial belief; random belief expansion adds beliefs by simulating random actions from current beliefs to discover new ones. Exploratory belief expansion seeks spread across the belief space by selecting beliefs farthest from the current set, to improve coverage.

What is SARSUP and how does it relate to belief sampling in PBVI?

SARSUP is a reference algorithm for more intelligent belief selection in POMDP planning. It’s mentioned as a related approach to PBVI for selecting informative beliefs, often cited in literature on offline planning.

How do upper and lower bounds relate to policy quality? Can they be used together?

Upper bounds (e.g., QMDP and FIB) bound the best possible performance under certain assumptions, while lower bounds (PBVI and randomized PBVI) bound the worst-case performance given a policy. Together they provide a gap estimate for how close the current policy is to optimal, and help decide next steps in offline planning.

Key Moments

Stanford AA228 Decision Making Under Uncertainty | Autumn 2025 | Offline Belief State Planning

Stanford Online

Education3 min read76 min video

Feb 25, 2026|7,204 views|201|5

Stanford Stanford Online

Save to Pod

Key Moments

TL;DR

Offline POMDPs: QMDP, fast bound, PBVI, and belief expansion for scalable planning.

Key Insights

Exact POMDP solutions are intractable; offline approximate methods (QMDP, FIB, PBVI) offer scalable alternatives.

QMDP reduces POMDPs to a weighted interpolation of state-action values (alpha vectors) using the belief over states, yielding an upper bound.

Fast Informed Bound tightens the QMDP upper bound by partially accounting for the observation model, at higher computational cost.

Point-Based Value Iteration (PBVI) and Randomized PBVI provide lower bounds by focusing updates on a finite set of belief points connected to AI policies.

Belief point selection (random and exploratory) is crucial to cover reachable and informative regions of belief space efficiently.

Limitations: QMDP often ignores information-seeking actions, and bounds depend on initialization; PBVI and FIB trade accuracy for computation.

INTRODUCTION TO OFFLINE POMDP SOLUTIONS

The lecture starts by acknowledging the intractability of exact POMDP solutions for realistic problems. To deploy decision making under uncertainty, we compute policies offline and execute them online, avoiding expensive real-time computation. The instructor uses familiar examples like aircraft collision avoidance and the crying baby to illustrate the shift from fully observable MDPs to partially observable settings, where the state is hidden and we must rely on belief states. The talk contrasts exact methods (which explode with horizon and state space) with practical offline approximations that provide usable policies with measurable guarantees.

QMDP AND ALPHA VECTORS

QMDP is introduced as a pragmatic way to leverage MDP methods in a POMDP setting. The idea is to pretend our problem is fully observable, solve an MDP to obtain a state-action value function (Q-values) or alpha vectors, and then compute a decision by weighting these Q-values by the current belief over states. The action with the highest weighted value is chosen. This yields a policy that is simple to compute offline and then fixed during execution. Alpha vectors are used to represent, for each action, the expected utility per possible true state; decisions come from belief-weighted sums rather than true state values.

UPPER BOUNDS: QMDP AND FAST INFORMED BOUND

A key property of QMDP is that it provides an upper bound on the true POMDP value function, since it assumes perfect knowledge of the next state. To tighten this bound, the Fast Informed Bound (FIB) incorporates the observation model, summing over possible observations the agent could receive next. FIB is more computationally intensive than QMDP but often yields a significantly tighter bound. The practical upshot is we obtain computable upper bounds that help gauge how close our offline policies might be to the best possible policy under uncertainty.

LOWER BOUNDS: PBVI AND RANDOMIZED PBVI

To complement upper bounds, the lecture covers lower-bounding approaches. Point-Based Value Iteration (PBVI) maintains a finite set of alpha vectors, each associated with a belief point, and iteratively backups to improve the estimate of the value function at those points. Randomized PBVI (RPBVI) reduces computational cost by updating a random subset of belief points per iteration, trading some tightness for speed. Initialization typically uses a lower bound (best action from the worst state) to ensure the method is conservatively improving toward the true value.

BELIEF POINT SELECTION: RANDOM AND EXPLORATORY EXPANSION

Since PBVI and RPBI rely on a finite set of belief points, selecting informative and reachable beliefs is crucial. Random belief expansion grows beliefs by simulating random actions from current beliefs, ensuring new points are reachable. Exploratory belief expansion aims to spread points across the belief space by choosing beliefs that maximize distance from the current set, promoting coverage of diverse scenarios. These strategies balance tractability with the desire to approximate the value function where it matters most for real-world decisions.

TAKEAWAYS, LIMITATIONS, AND CONTEXT FOR ONLINE METHODS

The session ties offline methods to practical outcomes: QMDP and FIB yield upper bounds; PBVI and RPBI yield lower bounds; together they frame what a controller can safely guarantee. The methods were discussed in the context of safety-critical systems (e.g., AKSX collision avoidance) and larger design pipelines where verification is vital. The main caveats include QMDP’s potential neglect of information-gathering actions and the dependence of bounds on initialization and belief point selection. The next lecture promises exploration of online methods that exploit these bounds for real-time adaptation.

Mentioned in This Episode

●Software & Apps

●Organizations

●Concepts

●People Referenced

Descriptive Cheat Sheet: Practical PBVI/QMDP Toolkit

Practical takeaways from this episode

Do This

Use QMDP to get a quick, computable upper bound on the value function when you want a fast baseline.

Use Fast Informed Bound (FIB) to obtain a tighter upper bound by incorporating the observation model.

Use PBVI or randomized PBVI to obtain lower bounds for offline policy approximation.

Choose belief points that are reachable from the initial belief; spread them to cover the belief space better.

Perform backups at beliefs to progressively improve the alpha-vectors that represent the policy.

Be mindful of information-gathering limitations: QMDP can overestimate if future observations are not properly modeled.

Avoid This

Don't rely on QMDP when information gathering is crucial; it often does not incentivize exploratory actions.

Don't assume convergence guarantees from a single upper-bound method; use both upper and lower bounds to gauge gap.

Don't omit pruning of dominated alpha-vectors in PBVI/randomized PBVI; it helps keep computation tractable.

Don't overfit belief-point sets to a small region of the belief space; ensure some spread for better generalization.

Common Questions

QMDP treats the POMDP as if the environment were an MDP by computing the traditional Q-values for each state-action pair and then weighting those values by the current belief over states. The action with the highest weighted Q-value is chosen. This yields a simple, offline, approximate policy and provides an upper bound on the true POMDP value because it ignores future information-gathering actions.

Topics

POMDP Offline Planning QMDP Alpha Vectors Belief State Aircraft Collision Avoidance Fast Informed Bound Point-based Value Iteration Randomized PBVI SARSUP Algorithm Cring Baby Problem Upper Bound Lower Bound

Mentioned in this video

People

Sydney

Speaker of the lecture; graduate student/postdoc in Michael's lab.

Software & Apps

PBVI (Point-Based Value Iteration)

Offline method maintaining a set of alpha-vectors associated with belief points to approximate the POMDP value function.

QMDP

Quantified Myopic/Quasi-MMDP approach: an offline POMDP method that uses weighted MDP Q-values based on belief to select actions.

Randomized PBVI

A variant of PBVI that updates a subset of belief points per iteration to reduce computation.

SARSUP algorithm for POMDPs

Algorithm for smarter belief selection to improve PBVI-like offline planning; mentioned as related work.

Fast Informed Bound

Upper-bound technique that incorporates the observation model to tighten the bound beyond QMDP.

Products

AKSX collision avoidance system

Collision avoidance system discussed in the course; the QMDP method is used in AKSX.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free