Stanford AA228 Decision Making Under Uncertainty | Autumn 2025 | Offline Belief State Planning

Stanford OnlineStanford Online
Education3 min read76 min video
Feb 25, 2026|6,345 views|190|5
Save to Pod

Key Moments

TL;DR

Offline POMDPs: QMDP, fast bound, PBVI, and belief expansion for scalable planning.

Key Insights

1

Exact POMDP solutions are intractable; offline approximate methods (QMDP, FIB, PBVI) offer scalable alternatives.

2

QMDP reduces POMDPs to a weighted interpolation of state-action values (alpha vectors) using the belief over states, yielding an upper bound.

3

Fast Informed Bound tightens the QMDP upper bound by partially accounting for the observation model, at higher computational cost.

4

Point-Based Value Iteration (PBVI) and Randomized PBVI provide lower bounds by focusing updates on a finite set of belief points connected to AI policies.

5

Belief point selection (random and exploratory) is crucial to cover reachable and informative regions of belief space efficiently.

6

Limitations: QMDP often ignores information-seeking actions, and bounds depend on initialization; PBVI and FIB trade accuracy for computation.

INTRODUCTION TO OFFLINE POMDP SOLUTIONS

The lecture starts by acknowledging the intractability of exact POMDP solutions for realistic problems. To deploy decision making under uncertainty, we compute policies offline and execute them online, avoiding expensive real-time computation. The instructor uses familiar examples like aircraft collision avoidance and the crying baby to illustrate the shift from fully observable MDPs to partially observable settings, where the state is hidden and we must rely on belief states. The talk contrasts exact methods (which explode with horizon and state space) with practical offline approximations that provide usable policies with measurable guarantees.

QMDP AND ALPHA VECTORS

QMDP is introduced as a pragmatic way to leverage MDP methods in a POMDP setting. The idea is to pretend our problem is fully observable, solve an MDP to obtain a state-action value function (Q-values) or alpha vectors, and then compute a decision by weighting these Q-values by the current belief over states. The action with the highest weighted value is chosen. This yields a policy that is simple to compute offline and then fixed during execution. Alpha vectors are used to represent, for each action, the expected utility per possible true state; decisions come from belief-weighted sums rather than true state values.

UPPER BOUNDS: QMDP AND FAST INFORMED BOUND

A key property of QMDP is that it provides an upper bound on the true POMDP value function, since it assumes perfect knowledge of the next state. To tighten this bound, the Fast Informed Bound (FIB) incorporates the observation model, summing over possible observations the agent could receive next. FIB is more computationally intensive than QMDP but often yields a significantly tighter bound. The practical upshot is we obtain computable upper bounds that help gauge how close our offline policies might be to the best possible policy under uncertainty.

LOWER BOUNDS: PBVI AND RANDOMIZED PBVI

To complement upper bounds, the lecture covers lower-bounding approaches. Point-Based Value Iteration (PBVI) maintains a finite set of alpha vectors, each associated with a belief point, and iteratively backups to improve the estimate of the value function at those points. Randomized PBVI (RPBVI) reduces computational cost by updating a random subset of belief points per iteration, trading some tightness for speed. Initialization typically uses a lower bound (best action from the worst state) to ensure the method is conservatively improving toward the true value.

BELIEF POINT SELECTION: RANDOM AND EXPLORATORY EXPANSION

Since PBVI and RPBI rely on a finite set of belief points, selecting informative and reachable beliefs is crucial. Random belief expansion grows beliefs by simulating random actions from current beliefs, ensuring new points are reachable. Exploratory belief expansion aims to spread points across the belief space by choosing beliefs that maximize distance from the current set, promoting coverage of diverse scenarios. These strategies balance tractability with the desire to approximate the value function where it matters most for real-world decisions.

TAKEAWAYS, LIMITATIONS, AND CONTEXT FOR ONLINE METHODS

The session ties offline methods to practical outcomes: QMDP and FIB yield upper bounds; PBVI and RPBI yield lower bounds; together they frame what a controller can safely guarantee. The methods were discussed in the context of safety-critical systems (e.g., AKSX collision avoidance) and larger design pipelines where verification is vital. The main caveats include QMDP’s potential neglect of information-gathering actions and the dependence of bounds on initialization and belief point selection. The next lecture promises exploration of online methods that exploit these bounds for real-time adaptation.

Descriptive Cheat Sheet: Practical PBVI/QMDP Toolkit

Practical takeaways from this episode

Do This

Use QMDP to get a quick, computable upper bound on the value function when you want a fast baseline.
Use Fast Informed Bound (FIB) to obtain a tighter upper bound by incorporating the observation model.
Use PBVI or randomized PBVI to obtain lower bounds for offline policy approximation.
Choose belief points that are reachable from the initial belief; spread them to cover the belief space better.
Perform backups at beliefs to progressively improve the alpha-vectors that represent the policy.
Be mindful of information-gathering limitations: QMDP can overestimate if future observations are not properly modeled.

Avoid This

Don't rely on QMDP when information gathering is crucial; it often does not incentivize exploratory actions.
Don't assume convergence guarantees from a single upper-bound method; use both upper and lower bounds to gauge gap.
Don't omit pruning of dominated alpha-vectors in PBVI/randomized PBVI; it helps keep computation tractable.
Don't overfit belief-point sets to a small region of the belief space; ensure some spread for better generalization.

Common Questions

QMDP treats the POMDP as if the environment were an MDP by computing the traditional Q-values for each state-action pair and then weighting those values by the current belief over states. The action with the highest weighted Q-value is chosen. This yields a simple, offline, approximate policy and provides an upper bound on the true POMDP value because it ignores future information-gathering actions.

Topics

Mentioned in this video

More from Stanford Online

View all 12 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free