Navigation with Large Language Models: Abstract & Introduction

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

Table of Links

Abstract

Navigation in unfamiliar environments presents a major challenge for robots: while mapping and planning techniques can be used to build up a representation of the world, quickly discovering a path to a desired goal in unfamiliar settings with such methods often requires lengthy mapping and exploration. Humans can rapidly navigate new environments, particularly indoor environments that are laid out logically, by leveraging semantics — e.g., a kitchen often adjoins a living room, an exit sign indicates the way out, and so forth. Language models can provide robots with such knowledge, but directly using language models to instruct a robot how to reach some destination can also be impractical: while language models might produce a narrative about how to reach some goal, because they are not grounded in real-world observations, this narrative might be arbitrarily wrong. Therefore, in this paper we study how the “semantic guesswork” produced by language models can be utilized as a guiding heuristic for planning algorithms. Our method, Language Frontier Guide (LFG), uses the language model to bias exploration of novel real world environments by incorporating the semantic knowledge stored in language models as a search heuristic for planning with either topological or metric maps. We evaluate LFG in challenging real-world environments and simulated benchmarks, outperforming uninformed exploration and other ways of using language models.

Keywords: navigation, language models, planning, semantic scene understanding

1 Introduction

Navigation in complex open-world environments is conventionally viewed as the largely geometric problem of determining collision-free paths that traverse the environment from one location to another. However, real-world environments possess semantics. Imagine navigating an airport to get to a terminal: your prior knowledge about the way such buildings are constructed provides extensive guidance, even if this particular airport is unfamiliar to you. Large language models (LLMs) and various language embedding techniques have been studied extensively as ways to interpret the semantics in user-specified instructions (e.g., parsing “go to the television in the living room” and grounding it in a specific spatial location), but such models can provide much more assistance in robotic navigation scenarios by capturing rich semantic knowledge about the world. For instance, when looking for a spoon in an unseen house, the LLM can produce a “narrative” explaining why going towards a dishwasher may eventually lead you to find the spoon, and that the robot should prioritize that direction. This is similar to how a person might imagine different ways that the available subgoals might lie on the path to the goal, and start exploring the one for which this ”narrative” seems most realistic. However, since language models are not grounded in the real world, such models do not know the spatial layout of the robot’s surroundings (e.g., there is a couch that the robot needs to circumnavigate). To utilize the semantic knowledge in language models to aid in embodied tasks, we should not just blindly follow the language model suggestions, but instead use them as proposals or navigational heuristics. In this paper, we study how that might be accomplished.

We study this idea in the context of visual navigation, where a robot is tasked with reaching a goal denoted by a natural language query q (see Fig. 1) in a novel environment using visual observations. The robot has no prior experience in the target environment, and must explore the environment to look for the goal. While narratives generated by an LLM may not be sufficient for navigation by themselves, they contain useful cues that can be used to inform or guide the behavior of the underlying navigation stack for the language navigation task (e.g., by choosing between collision free subgoal proposals that avoid the couch and lead to the ice tray). We show that this idea can be combined with frontier-based exploration, where the robot maintains a set of unvisited locations at its frontier, grounds them in text using a vision-language model (VLM), and scores the unvisited subgoals by using LLM reasoning.

We propose Language Frontier Guide, or LFG, a method for leveraging the reasoning capabilities of LLMs to produce a search heuristic for guiding exploration of previously unseen real-world environments, combining the strengths of search-based planning with LLM reasoning. LFG is agnostic of the memory representation and planning framework, and can be combined with both (i) a geometric navigation pipeline, building a metric map of the environment for planning and using a hand-designed controller, as well as (ii) a learning-based navigation pipeline, building a topological map for planning and using a learned control policy, yielding a versatile system for navigating to open-vocabulary natural language goals. Our experiments show that LFG can identify and predict simple patterns in previously unseen environments to accelerate goal-directed exploration. We show that LFG outperforms other LLM-based approaches for semantic goal finding in challenging real-world environments and on the Habitat ObjectNav benchmark.