paint-brush
Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal: Related Workby@escholar
234 reads

Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal: Related Work

tldt arrow

Too Long; Didn't Read

This work revisits the necessary and sufficient conditions of strictly optimally efficient heuristics for forward search algorithms.
featured image - Optimize Planning Heuristics to Rank, not to Estimate Cost-to-Goal: Related Work
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Leah Chrestien, Czech Technical University in Prague;

(2) Tomá˘s Pevný, Czech Technical University in Prague, and Gen Digital, Inc.;

(3) Stefan Edelkamp, Czech Technical University in Prague;

(4) Antonín Komenda, Czech Technical University in Prague.

The closest work to ours is [36, 38], which adjusts heuristic value by a policy estimated from the neural network. The theory therein has the same motivation, to minimize the number of expanded states, though it is more complicated than the theory presented here. For efficiency, [38] needs to modify the GBFS such that search nodes store policy and probability of reaching the node from the start. Moreover, in some variants (used in the experimental comparison below) the neural network has to estimate both heuristic value and policy.


Inspired by statistical surveys of heuristic functions in [57, 58], [18] proposes to optimize the rank of states preserving order defined by the true cost-to-goal. The formulation is therefore valid only for GBFS and not for A* search. Ties are not resolved, which means the strict optimal efficiency cannot be guaranteed. Similarly, heuristic values for domains where all actions have zero cost cannot be optimized. Lastly and importantly, the need to know the true cost-to-goal for all states in the training set greatly increases the cost of its construction as discussed above. From Bellman’s equation [49] arrives at the requirement that the smallest heuristic value of child nodes has to be smaller than that of the parent node. The resulting loss is regularized with L1 forcing the value of the heuristic to be close to cost-to-goal.


In [5], A* is viewed as a Markov Decision Process with the Q-function equal to the number of steps of A* reaching the solution. This detaches the heuristic values from the cost-to-goal cost, but it does not solve the problem with dead-ends and ties, since the optimization is done by minimizing L2 loss.


Symmetry of L2 (and of L1) loss are discussed in [51], as it does not promote the admissibility of the heuristic. It suggests asymmetric L1 loss with different weights on the left and right parts, but this does not completely solve the problem of estimating cost-to-goal (for example with ties, dead-ends). Parameters of potential heuristics [41] are optimized by linear programming for each problem instance to meet admissibility constraints while helping the search. On contrary, this work focuses on the efficiency of the search for many problem instances but cannot guarantee admissibility in general.


Combining neural networks with discrete search algorithms, such that they become an inseparable part of the architecture is proposed in [55] and [60]. The gradient with respect to parameters has to be propagated through the search algorithm which usually requires both approximation and the search algorithm to be executed during every inference.


A large body of literature [2, 15, 20, 46] improves the Monte Carlo Tree Search (MCTS), which has the advantage to suppress the noise in estimates of the solution costs. These methods proposed there are exclusive to MCTS and do not apply to variants of the forward search algorithm. Similarly, a lot of works [5, 8, 16, 19, 45, 53, 61] investigate architectures of neural networks implementing the heuristic function h(s, θ), ideally for arbitrary planning problems. We use [8, 45] to implement h(s, θ), but remind our readers that our interest is the loss function minimized when optimizing θ.


Finally, [1] proposes to construct a training set similarly to pattern databases [12] by random backward movements from the goal state and to adapt A* for multi-core processors by expandingk > 1 states in each iteration. These improvements are independent of those proposed here and can be combined.