Authors:
(1) Kun Lan, University of Science and Technology of China;
(2) Haoran Li, University of Science and Technology of China;
(3) Haolin Shi, University of Science and Technology of China;
(4) Wenjun Wu, University of Science and Technology of China;
(5) Yong Liao, University of Science and Technology of China;
(6) Lin Wang, AI Thrust, HKUST(GZ);
(7) Pengyuan Zhou, University of Science and Technology of China.
3. Method and 3.1. Point-Based rendering and Semantic Information Learning
3.2. Gaussian Clustering and 3.3. Gaussian Filtering
4. Experiment
4.1. Setups, 4.2. Result and 4.3. Ablations
Recently, 3D Gaussian, as an explicit 3D representation method, has demonstrated strong competitiveness over NeRF (Neural Radiance Fields) in terms of expressing complex scenes and training duration. These advantages signal a wide range of applications for 3D Gaussians in 3D understanding and editing. Meanwhile, the segmentation of 3D Gaussians is still in its infancy. The existing segmentation methods are not only cumbersome but also incapable of segmenting multiple objects simultaneously in a short amount of time. In response, this paper introduces a 3D Gaussian segmentation method implemented with 2D segmentation as supervision. This approach uses input 2D segmentation maps to guide the learning of the added 3D Gaussian semantic information, while nearest neighbor clustering and statistical filtering refine the segmentation results. Experiments show that our concise method can achieve comparable performances on mIOU and mAcc for multi-object segmentation as previous single-object segmentation methods.
Index Terms— 3D Gaussian, 3D Segmentation
The recently emerged 3D Gaussian technique [1] marks a significant advancement over previous 3D representation methods such as point clouds [2], meshes [3], signed distance functions (SDF) [4], and neural radiance fields (NeRF) [5], especially in terms of training time and scene reconstruction quality. The mean of each 3D Gaussian represents the position of its center point, the covariance matrix indicates rotation and size, and spherical harmonics express color. Starting with point clouds obtained from SFM [6], 3D Gaussians inherently contain the scene’s geometric information, thus saving time in locating areas with concentrated objects in space. Moreover, their explicit expression method further accelerates calculations of color and density for every 3D Gaussian in space, enabling real-time rendering. Additionally, adaptive density control endows them with the capability to express detailed features. These advantages make it widely applicable in 3D understanding and editing. Nonetheless, there is little research on 3D Gaussian segmentation, which is another critical pillar of the realm.
A few Gaussian segmentation methods have been proposed recently, yet they require further improvement. For example, Gaussian Grouping [7] requires an extended training period of about 15 minutes. SAGA [8] is complex in its implementation and struggles with segmenting multiple objects simultaneously. Additionally, the explicit expression of 3D Gaussians leads to storage overhead, preventing it from directly transferring 2D semantic features into 3D, as in NeRF segmentation [9, 10]. Finally, the scarcity of datasets and the lack of annotations impede the application of supervised segmentation methods, commonly utilized in 2D and point cloud segmentation.
In light of the aforementioned challenges, we propose leveraging a pre-trained 2D segmentation model to guide 3D Gaussian segmentation. Inspired by the 2D segmentation approach, which assigns a probability distribution vector for each pixel across different categories, we first assign an object code to each 3D Gaussian to indicate the Gaussian’s categorical probability distribution. Subsequently, we employ an algorithm that guides the classification of each 3D Gaussian by minimizing the error between the 2D segmentation map and the rendered segmentation map at a given pose. Finally, we employ KNN clustering to resolve semantic ambiguity in 3D Gaussians and statistical filtering to remove erroneously segmented 3D Gaussians. We validated the effectiveness of our approach through experiments in object-centric and 360° scenes. Our contributions can be summarized as follows.
• We propose an efficient 3D Gaussian segmentation method supervised by 2D segmentation, which can learn the semantic information of a 3D scene in less than two minutes and segment multiple objects in 1-2 seconds for a given viewpoint.
• Extensive experiments on LLFF, NeRF-360, and MipNeRF 360 have demonstrated the effectiveness of our method, obtaining an mIOU of 86%.
This paper is available on arxiv under CC 4.0 license.