Navigation with Large Language Models: LLM Heuristics for Goal-Directed Exploration

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

Table of Links

5 LLM Heuristics for Goal-Directed Exploration

Given the LLM scoring pipeline outlined in the previous section, our key insight is that we can incorporate these scores in a search-based planning pipeline to heuristically guide the search process. We instantiate LFG using frontier-based exploration (FBE) and LLM scores generated via polling.

FBE: This method maintains a “map” of the seen parts of the environment, which may be geometric [33] or topological [34], and a frontier separating it from the unexplored parts. By navigating to the nearest point of the frontier, the robot explores new areas of the environment until it finds the goal object or completes exploration without finding it. A standard FBE implementation is presented in Algorithm 2 inblack text. The robot maintains either a 2D metric map of its surroundings, or a topological map whose nodes are comprised of the robot’s visual observations and edges represent paths taken in the environment. Additionally, we also store semantic labels corresponding to objects detected in the robot’s observations, which are used to ground the observations in text.

At a fixed re-planning rate, the high-level planner computes its frontier fi (Line 10), and picks the frontier point that is closest to the current location, i.e., maximizing the distance score (Line 16), and then navigates to this frontier (Line 21). At any point in this process, if the agent’s semantic detector detects an object of the same category as the query q, it navigates directly to this object and the trajectory ends.

Incorporating LLM scores: Our method, LFG, extends FBE by using an additional search heuristic obtained by polling LLMs for semantic “scores”. The modifications to FBE are marked in purple in Algorithm 2. After enumerating the frontiers, LFG uses the semantic labels from a VLM [35] to ground subgoal images at each frontier fi (Line 11). These images are converted into textual strings, and form the subgoal candidates si that can be scored using Algorithm 1. We associate each frontier point fi with the nearest object cluster ci (Line 17), and compute LLM scores for each point as follows:

where p is the current position of the agent, and wp, wn are hyperparameters (see Appendix A.1). We then choose the frontier with the highest score to be the next subgoal (Line 21), navigate to it using a local controller, and repeat the planning process. Algorithm 2 outlines the general recipe for integrating LLM scores as a planning heuristic. Please see Appendix A for specific instantiations of this system with geometric and topological maps, and more details about the referenced subroutines.