This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Oded Blumenthal, Software and Information Systems Engineering, Ben Gurion University, Israel;
(2) Guy Shani, Software and Information Systems Engineering, Ben Gurion University, Israel.
Partially observable Markov decision processes (POMDP) are a useful model for decision-making under partial observability and stochastic actions. Partially Observable Monte-Carlo Planning is an online algorithm for deciding on the next action to perform, using a Monte-Carlo tree search approach, based on the UCT (UCB applied to trees) algorithm for fully observable Markov-decision processes. POMCP develops an action-observation tree, and at the leaves, uses a rollout policy to provide a value estimate for the leaf. As such, POMCP is highly dependent on the rollout policy to compute good estimates, and hence identify good actions. Thus, many practitioners who use POMCP are required to create strong, domain-specific heuristics.
In this paper, we model POMDPs as stochastic contingent planning problems. This allows us to leverage domain-independent heuristics that were developed in the planning community. We suggest two heuristics, the first is based on the well-known hadd heuristic from classical planning, and the second is computed in belief space, taking the value of information into account.
Many autonomous agents operate in environments where actions have stochastic effects, and important information that is required for obtaining the goal is hidden from the agent. Agents in such environments typically execute actions and sense some observations that result from these actions. Based on the accumulated observations the agents can better estimate their current state and decide on the next action to execute. Such environments are often modeled as partially observable Markov decision processes (POMDPs) [28].
POMPD models allow us to reason about the hidden state of the system, typically using a belief state – a distribution over the possible environment states. The belief state can be updated given the executed action and the received observation. One can compute a policy, a mapping from beliefs to actions, that dictates which action should be executed given the current belief. Many algorithms were suggested for computing such policies [25].
However, in larger environments, it often becomes difficult to maintain a belief state, let alone compute a policy for all possible belief states. In such cases, one can use an online re-planning approach, where after every action is executed, the agent computes which action to execute next. Such online approaches replace the lengthy single policy computation which is done offline, before the agent begins to act, with a sequence of shorter computations, which are executed online, during execution, after each action [21].
POMCP [27] is such an online replanning approach, extending the UCT algorithm for fully observable Markov decision processes (MDPs) to POMDPs. POMCP operates by constructing online a search tree, interleaving decision and observation nodes. The root of the tree is a decision node. Each decision node has an outgoing edge for every possible action, ending at an observation node. Then, the outgoing edges from an observation node denote the possible observations that result from the incoming action. The agent computes a value for each node in the tree, and then, the agent can choose, from the root node, the action associated with the edge leading to the highest value child node.
To evaluate the value of leaf nodes, POMCP executes a random walk in belief space, known as a rollout, where the agent selects actions from some rollout policy to construct a trajectory in belief space and obtain an estimation of the quality of the leaf node. Clearly, this evaluation is highly dependant on the ability of the rollout policy to reach the agent goals. In complex problems, obtaining the goal may require a lengthy sequence of actions [16], and until the goal is reached, no meaningful rewards are obtained. Indeed, practitioners that use POMCP often implement complex domain-specific heuristics for the rollout policy.
In this paper we focus on suggesting domain-independent heuristics for rollout policies. We leverage work in automated planning, using heuristics defined for classical and contingent planning problems [7, 8, 6]. We thus represent POMDP problems in a structured manner, using boolean facts to capture the state of the environments. This allows both for a compact representation of large problems, compared with standard flat representations that do not scale, as well as the ability to use classical planning heuristics.
We begin by suggesting using the well-known hadd heuristic for choosing rollout actions [2]. This heuristic searches forward in a delete relaxation setting, until the goal has been reached. Then, the value of an action is determined by the number of steps in the delete relaxation space following the action, required for obtaining the goal.
Next, we observe that any state-based rollout policy is inherently limited in its ability to evaluate the missing information required for reaching the goal, and hence, provide some estimate as to the value of information [12] of an action. We hence suggest a multi-state rollout policy, where actions are executed on a set of states jointly, and observations are used to eliminate states that are incompatible with the observed value. We show that this heuristic is much more informed in domains that require complex information-gathering strategies.
For an empirical evaluation, we extend domains from the contingent planning community with stochastic effects. We evaluate our heuristics, comparing them to random rollouts, showing that they allow us to provide significantly better behavior.