New System Combines SLAM and Language Models for Online 3D Scene Mapping

Table of Links

A. Mobile Robot Mapping

Early robotic scene mapping research focused on the development of the core competencies in the metric [20], [21], [22] and topological [23], [24] knowledge spaces, extensively centered around the question of map and knowledge representation. For successful task execution, data-rich 3D scene representation and self-localization are critical, enabled by Simultaneous Localization and Mapping (SLAM) algorithms [25], [26], [27]. 3D spatial maps have traditionally been represented by voxel grids, points or surfels, and more recently, neural radiance fields [28], [29]. Each of these approaches come with their own limitations. The accuracy and expressiveness of occupancy and voxel grids are resolution-bounded due to quantization. Points and surfels are discontinuous when rendered, making it challenging to supervise features in a continuous manner. Recent SLAM methods adopting a neural radiance field representation such as NICE-SLAM [30] and NeRF-SLAM [31] are constrained by their implicit representation making it difficult to update geometry over time.

B. Semantic Scene Mapping for Robotics

Semantic grounding, particularly in 3D representations, is a longstanding problem [32] to integrate semantic knowledge of objects and the surrounding environment into a mapped scene. The first definition of semantic mapping for robotics is provided by Nuchter, ¨ et al. as a spatial map, 2D or 3D, augmented by information about entities, i.e., objects, functionalities, or events located in space [33]. An early work proposes concurrent object identification and localization using a supervised hierarchical neural network classifier on image color histogram feature vectors [34]. However, because these approaches rely on supervised datasets [35], [36], they work only on a closed set of vocabularies and do not generalize to open-ended semantic queries. More recent works have focused on using large visionlanguage models to support open-vocabulary queries. This includes both 2D [37], [38], [39] and 3D, such as VL-Maps [8] and CLIP-fields [7], which assigns a CLIP feature to every point in the 3D scene. This can be used for setting navigation goals with natural language queries. OpenScene [14] ensembles open-vocabulary feature encoders and 3D point networks to form a per-point feature-vector allowing natural language querying on pointclouds. ConceptFusion [40] develops 3D open-set multimodal mapping by projecting CLIP [41] and pixel aligned features into 3D points, and additionally fuse other modalities such as audio into the scene representation. ConceptGraphs [42] model spatial relationships as well as the semantic objects in the scene to reason over spatial and semantic concepts.

Semantic fields have been applied not only to scene-level understanding for mobile robots but also to manipulation. In these settings, NeRFs [2] have been a popular 3D representation, following from Distilled Feature Fields [9], Neural Feature Fusion Fields [10], and Language Embedded Radiance Fields (LERFs) [1]. These works learn an semantic field in addition to the color field. LERF supports a scaleconditioned feature field, which takes in an extra scalar as input to facilitate feature encodings at multiple scene scales. For manipulation, the feature fields have been shown to facilitate learning from few-shot demonstrations [4], policy learning [3], zero-shot trajectory generation [43], and taskoriented grasping [5]. LERF-TOGO’s [5] zero-shot task-oriented grasping performance is fully based on LERF, as LERF’s multi-scale semantics allow for both object- and part-level understanding. This property is also valuable in scene-level settings, where a human may specify a collection of objects, e.g., utensils. LEGS maintains this multi-scale understanding, while speeding up training and querying time, by using Gaussian Splats [16] which have a significantly faster render time.

C. 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) originated in 2023 [16] to model a scene as an explicit collection of 3D Gaussians. Each Gaussian is described by its position vector µ, covariance matrix Σ, and an opacity parameter α, creating a representation that is both succinct and adaptable for static environments. The choice of 3D Gaussians over traditional point clouds is strategic; their inherent differentiability and the ease with which they can be rasterized into 2D splats enable accelerated α-blending during rendering. By avoiding volumetric ray casting employed by Neural Radiance Fields (NeRFs), Gaussian Splatting has a substantial speed advantage and can support real-time rendering capabilities. Soon after its release, 3DGS has been applied to mapping [44], semantic mapping [45], navigation [46], and semantic fields [17], [18]. 3DGS’s fast rendering time speeds up optimization, making it suitable for integrating visual SLAM and natural language queries for 3D semantic fields. 3DGS has also been demonstrated in both indoor datasets [47], [48], [49], and outdoor driving scenes with multiple cameras where all sensor data is collected before 3DGS training.

D. Concurrent Research

Other work in 3D Gaussian Splatting has focused on embedding language features, and separately, training online. Learning semantic features for Gaussians have taken either one of two approaches: calculating it on-the-fly by querying a network or maintaining multi-dimensional features for each Gaussian. FMGS [17] uses multi-resolution hash encodings [19] optimized with a render-time loss to combine CLIP features with a map of 3D Gaussians. LEGS similarly

Fig. 2: LEGS System Integration For LEGS, we use a Fetch robot with a custom multicamera configuration where a Realsense D455 is facing forward while 2 Zed cameras face the left and right sides respectively. The left Zed image stream is inputted into DROID-SLAM to compute pose estimates for the left camera, and the corresponding extrinsics are used to compute the pose estimates for the other Zed camera and D455. These image-poses are then used for concurrent Gaussian splat and CLIP training online. From there, the Gaussian splat can be queried for an object (ex. “First Aid Kit”), and the corresponding relevancy field will be computed to localize the desired object.

utilizes a hash encoding for its feature field, however it includes scale-conditioning as opposed to averaging CLIP across scales, retaining finer-grained language understanding. On the other hand, LangSplat [18] embeds language in 3DGS by training a scene-specific autoencoder to map between CLIP embeddings and a lower-dimensional latent feature associated with each 3D gaussian. Like traditional radiance field methods, LangSplat assumes poses are corresponded with all scene images prior to 3DGS training as it requires training a VAE over all images of a scene before starting its 3D optimization. However, for robotic systems, it is often desirable to develop 3D semantic understanding online as the robot explores new and previously unseen large-scale environments. SplaTAM [44] optimizes both camera pose and the 3D Gaussian map simultaneously for single-camera setups. However, having multiple cameras and viewpoints can enhance efficient environment data collection. Additionally, SplaTAM lacks semantic features, which is important for identifying and interacting with objects in a 3D scene. To our knowledge, LEGS is the first system that integrates the advantages of both online 3DGS training and language-aligned feature supervision into Gaussian splats for large-scale scene understanding.

Authors:

Justin Yu
Kush Hari
Kishore Srinivas
Karim El-Refai
Adam Rashid
Chung Min Kim
Justin Kerr
Richard Cheng
Muhammad Zubair Irshad
Ashwin Balakrishna
Thomas Kollar Ken Goldberg

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

New System Combines SLAM and Language Models for Online 3D Scene Mapping

Table of Links

RELATED WORK