This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Oded Blumenthal, Software and Information Systems Engineering, Ben Gurion University, Israel;
(2) Guy Shani, Software and Information Systems Engineering, Ben Gurion University, Israel.
Table of Links
- Introduction
- Background
- Related Work
- POMCP for Stochastic Contingent Planning
- Domain Independent Heuristics for POMCP
- Empirical Evaluation
- Conclusion and References
3 Related Work
Augmenting MCTS methods with heuristics in the context of fully observable MDPs was previously suggested. [14] describe an MCTS tree based approach that uses planning based heuristics. The PROST planner [13] uses a heuristic for estimating the value of states. The DP-UCT approach [26] use a planning heuristic based on deep learning for the rollout phase. We are not aware of previous attempts to adapt these approaches to POMDPs.
There were several extensions suggested for POMCP [16]. For example [17] considers dynamic environments, and [11] consider non-linear dynamics. All these methods rely on rollouts, and can hence leverage the rollout strategies that we suggest here.
POMCPOW [30] extends POMCP to the challenge of solving POMDPs with continuous state, action, and observation spaces. It constructs the search tree incrementally to explore additional regions of the observation and action spaces. It also requires a rollout policy to evaluate the utility of leaf nodes. Our rollout strategies relay on classical planning approaches which are discrete, and it is hence unlikely that our methods can be extended to continuous domains.
DESPOT [29, 32, 18] is an online POMDP solver based on tree search, similar to POMCP. DESPOT uses a different strategy than the UCB rule for constructing the tree, designed to avoid the overly greedy nature of POMCP exploration strategy. DESPOT also requires a so-called default policy to evaluate the utility of a leaf in the tree, and the authors stress the importance of a strong default policy to improve the convergence. Thus, our methods can be directly applied to DESPOT as well. We chose here to focus on POMCP rather than DESPOT, because POMCP is a simpler method, which allows us to better focus on the importance of the heuristic function, independent of the effect of the various augmentations that DESPOT adds on top of POMCP.
[22] suggest a method called PSG to evaluate the proximity of states to the goal. They suggest using PSG in several places within POMCP including rollouts. PSG assumes that states are defined in a factored manner using state features, and computes a function from features to the goal using subgoals. In essence, their approach can be considered as a type of heuristic, which is highly related to the concept of landmarks in classical planning [19]. Representing the POMDP as a stochastic contingent planning problem, as we do, allows us to use any heuristic developed in the planning community, and can hence be considered to be an extension of PSG.
Similarly, [31] also find it difficult to provide good rollout strategies to compute the value of POMCP leaves. Focusing on a robotics motion planning domain, they suggest SVE, state value estimator, that attempts to evaluate the utility of a state directly.
[15] also focus on the need to use heuristics for guiding search in POMDPs. They focus on RTDPBEL [4], an algorithm that runs forward trajectories in belief space to produce a policy. They show that using a heuristic can significantly improve RTDP-BEL. They use domain specific heuristics, and as such, our domain independent approach can also be applied to their methods.