Stanford AA228 Decision Making Under Uncertainty | Autumn 2025 | Offline Belief State Planning
Key Moments
Offline POMDPs: QMDP, fast bound, PBVI, and belief expansion for scalable planning.
Key Insights
Exact POMDP solutions are intractable; offline approximate methods (QMDP, FIB, PBVI) offer scalable alternatives.
QMDP reduces POMDPs to a weighted interpolation of state-action values (alpha vectors) using the belief over states, yielding an upper bound.
Fast Informed Bound tightens the QMDP upper bound by partially accounting for the observation model, at higher computational cost.
Point-Based Value Iteration (PBVI) and Randomized PBVI provide lower bounds by focusing updates on a finite set of belief points connected to AI policies.
Belief point selection (random and exploratory) is crucial to cover reachable and informative regions of belief space efficiently.
Limitations: QMDP often ignores information-seeking actions, and bounds depend on initialization; PBVI and FIB trade accuracy for computation.
INTRODUCTION TO OFFLINE POMDP SOLUTIONS
The lecture starts by acknowledging the intractability of exact POMDP solutions for realistic problems. To deploy decision making under uncertainty, we compute policies offline and execute them online, avoiding expensive real-time computation. The instructor uses familiar examples like aircraft collision avoidance and the crying baby to illustrate the shift from fully observable MDPs to partially observable settings, where the state is hidden and we must rely on belief states. The talk contrasts exact methods (which explode with horizon and state space) with practical offline approximations that provide usable policies with measurable guarantees.
QMDP AND ALPHA VECTORS
QMDP is introduced as a pragmatic way to leverage MDP methods in a POMDP setting. The idea is to pretend our problem is fully observable, solve an MDP to obtain a state-action value function (Q-values) or alpha vectors, and then compute a decision by weighting these Q-values by the current belief over states. The action with the highest weighted value is chosen. This yields a policy that is simple to compute offline and then fixed during execution. Alpha vectors are used to represent, for each action, the expected utility per possible true state; decisions come from belief-weighted sums rather than true state values.
UPPER BOUNDS: QMDP AND FAST INFORMED BOUND
A key property of QMDP is that it provides an upper bound on the true POMDP value function, since it assumes perfect knowledge of the next state. To tighten this bound, the Fast Informed Bound (FIB) incorporates the observation model, summing over possible observations the agent could receive next. FIB is more computationally intensive than QMDP but often yields a significantly tighter bound. The practical upshot is we obtain computable upper bounds that help gauge how close our offline policies might be to the best possible policy under uncertainty.
LOWER BOUNDS: PBVI AND RANDOMIZED PBVI
To complement upper bounds, the lecture covers lower-bounding approaches. Point-Based Value Iteration (PBVI) maintains a finite set of alpha vectors, each associated with a belief point, and iteratively backups to improve the estimate of the value function at those points. Randomized PBVI (RPBVI) reduces computational cost by updating a random subset of belief points per iteration, trading some tightness for speed. Initialization typically uses a lower bound (best action from the worst state) to ensure the method is conservatively improving toward the true value.
BELIEF POINT SELECTION: RANDOM AND EXPLORATORY EXPANSION
Since PBVI and RPBI rely on a finite set of belief points, selecting informative and reachable beliefs is crucial. Random belief expansion grows beliefs by simulating random actions from current beliefs, ensuring new points are reachable. Exploratory belief expansion aims to spread points across the belief space by choosing beliefs that maximize distance from the current set, promoting coverage of diverse scenarios. These strategies balance tractability with the desire to approximate the value function where it matters most for real-world decisions.
TAKEAWAYS, LIMITATIONS, AND CONTEXT FOR ONLINE METHODS
The session ties offline methods to practical outcomes: QMDP and FIB yield upper bounds; PBVI and RPBI yield lower bounds; together they frame what a controller can safely guarantee. The methods were discussed in the context of safety-critical systems (e.g., AKSX collision avoidance) and larger design pipelines where verification is vital. The main caveats include QMDP’s potential neglect of information-gathering actions and the dependence of bounds on initialization and belief point selection. The next lecture promises exploration of online methods that exploit these bounds for real-time adaptation.
Mentioned in This Episode
●Software & Apps
●Organizations
●Concepts
●People Referenced
Descriptive Cheat Sheet: Practical PBVI/QMDP Toolkit
Practical takeaways from this episode
Do This
Avoid This
Common Questions
QMDP treats the POMDP as if the environment were an MDP by computing the traditional Q-values for each state-action pair and then weighting those values by the current belief over states. The action with the highest weighted Q-value is chosen. This yields a simple, offline, approximate policy and provides an upper bound on the true POMDP value because it ignores future information-gathering actions.
Topics
Mentioned in this video
Speaker of the lecture; graduate student/postdoc in Michael's lab.
Offline method maintaining a set of alpha-vectors associated with belief points to approximate the POMDP value function.
Quantified Myopic/Quasi-MMDP approach: an offline POMDP method that uses weighted MDP Q-values based on belief to select actions.
Collision avoidance system discussed in the course; the QMDP method is used in AKSX.
A variant of PBVI that updates a subset of belief points per iteration to reduce computation.
Algorithm for smarter belief selection to improve PBVI-like offline planning; mentioned as related work.
Upper-bound technique that incorporates the observation model to tighten the bound beyond QMDP.
More from Stanford Online
View all 12 summaries
2 minDesign and Control of Haptic Systems: The Challenges of Robotics
2 minBiomechanics and Mechanobiology: Understanding How Human Beings Work
2 minWhat Are Biomechanics and Mechanobiology? Associate Professor Marc Levenston Explains
1 minWhat Is A Haptic Device? Professor Allison Okamura Explains.
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free