Rollout Heuristics: Domain Independent Heuristics for POMCPby@heuristicsearch

Rollout Heuristics: Domain Independent Heuristics for POMCP

tldt arrow

Too Long; Didn't Read

In this paper, we model POMDPs as stochastic contingent planning problems. This allows us to leverage domain-independent heuristics.
featured image - Rollout Heuristics: Domain Independent Heuristics for POMCP
Aiding in the focused exploration of potential solutions. HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.


(1) Oded Blumenthal, Software and Information Systems Engineering, Ben Gurion University, Israel;

(2) Guy Shani, Software and Information Systems Engineering, Ben Gurion University, Israel.

5 Domain Independent Heuristics for POMCP

We now describe the main contribution of this paper — two domain independent rollout heuristics that leverage methods developed in the automated planning community, using the structure specified in the stochastic contingent planning problem.

5.1 Delete Relaxation Heuristics

Delete relaxation heuristics are built upon the notion that if actions have only positive effects, then the number of actions that can be executed before the state becomes fixed is finite, and in many cases, small. Also, as actions cannot destroy the precondition of other actions, one can execute actions in parallel. Algorithm 2 portrays a delete relaxation heuristic.

Delete relaxation heuristics create a layered graph, interleaving action and fact layers. The first layer, which is a fact layer, contains all the facts that hold in the state for which the heuristic is computed (line 2). The second layer, which is an action layer, contains all the actions whose preconditions hold given the facts in the first layer (line 5). The next layer, which is again a fact layer, contains all the positive effects of the actions in the previous layer, as well as all facts from the previous layer (line 6), and so forth. We stop developing the graph once no new facts can be obtained (line 8).

5.2 Heuristics in Belief Space

A major disadvantage of the above heuristics is that they focus on a single state. When the agent is aware of the true state of the system, observations have no value. Hence, the above heuristics, as well as any heuristic that is based on a single state, do not provide an estimate for the value of information, which is a key advantage of POMDPs. We hence suggest now a heuristic that is computed over a set B of possible states (Algorithm 3).

We compute again the delete relaxation graph, with a few modifications. We compute for each state in B a separate fact layer. An action can be applied only if its preconditions are satisfied in the fact layers of all agents (line 7). This is equivalent to the requirement in contingent planning where an action is applicable only if it is applicable in all states in the current belief, where B is served as an approximation of the true belief state.

Second, our method leverages the deterministic observations, that allow us to filter out states that are inconsistent with the received observation (lines 8-14). When a sensing action can be applied, all states that do not agree with the value of s on the observation are discarded from B (lines 10-14). That is, we remove the fact layers corresponding to these states, and no longer consider them when computing which actions can be applied.

We stop when both no states were discarded, and no new facts were obtained (line 17). This process must take into account sensing actions to remove states that are incompatible with s, which would allow, at the next iteration, that action preconditions would be satisfied for less states, and hence additional actions can be executed.