Authors:
(1) Zhaoqing Wang, The University of Sydney and AI2Robotics;
(2) Xiaobo Xia, The University of Sydney;
(3) Ziye Chen, The University of Melbourne;
(4) Xiao He, AI2Robotics;
(5) Yandong Guo, AI2Robotics;
(6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence;
(7) Tongliang Liu, The University of Sydney. Table of Links Abstract and 1. Introduction 2. Related works 3. Method and 3.1. Problem definition 3.2. Baseline and 3.3. Uni-OVSeg framework 4. Experiments 4.1. Implementation details 4.2. Main results 4.3. Ablation study 5. Conclusion 6. Broader impacts and References A. Framework details B. Promptable segmentation C. Visualisation 4.1. Implementation details Datasets. During training, we randomly sample the 30% subset from the SA-1B dataset [34], which contains ∼ 3 million images and ∼ 0.3 billion masks. Although this supervision provides diverse binary masks, it lacks the semantic class for each mask. In addition, following Chen et al. [8], we collect about 1.3 million image-text pairs and use a large vision-language model to refine them. Afterward, we use the ChatGPT-based parser to extract entities with descriptive words from these text descriptions. Evaluation & metrics. We evaluate our model mainly on three tasks, including open-vocabulary semantic segmentation, open-vocabulary panoptic segmentation, and promptable segmentation. Following previous work [76], we adopt prompt engineering from [21, 66] and prompt templates from [22, 37]. For open-vocabulary semantic segmentation, we zero-shot evaluate the model on the COCO [40], ADE20K [84], PASCAL [18] datasets. The open-vocabulary semantic segmentation results are evaluated with the mean Intersection-over-Union (mIoU). For open-vocabulary panoptic segmentation, we evaluate the model on the COCO, ADE20K, and Cityscapes [15] datasets. We report the panoptic quality (PQ), semantic quality (SQ), and recognition quality (RQ) for open-vocabulary panoptic segmentation. For promptable segmentation, we report the 1-Point and 1-Box IoU (Oracle) on a wide range of datasets. Oracle denotes that we select the output mask with the max IoU by calculating the IoU between the prediction and target mask. More details can be found in Appendix B. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Zhaoqing Wang, The University of Sydney and AI2Robotics; (2) Xiaobo Xia, The University of Sydney; (3) Ziye Chen, The University of Melbourne; (4) Xiao He, AI2Robotics; (5) Yandong Guo, AI2Robotics; (6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence; (7) Tongliang Liu, The University of Sydney. Authors: Authors: (1) Zhaoqing Wang, The University of Sydney and AI2Robotics; (2) Xiaobo Xia, The University of Sydney; (3) Ziye Chen, The University of Melbourne; (4) Xiao He, AI2Robotics; (5) Yandong Guo, AI2Robotics; (6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence; (7) Tongliang Liu, The University of Sydney. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Related works 2. Related works 3. Method and 3.1. Problem definition 3. Method and 3.1. Problem definition 3.2. Baseline and 3.3. Uni-OVSeg framework 3.2. Baseline and 3.3. Uni-OVSeg framework 4. Experiments 4.1. Implementation details 4.1. Implementation details 4.2. Main results 4.2. Main results 4.3. Ablation study 4.3. Ablation study 5. Conclusion 5. Conclusion 6. Broader impacts and References 6. Broader impacts and References A. Framework details A. Framework details B. Promptable segmentation B. Promptable segmentation C. Visualisation C. Visualisation 4.1. Implementation details Datasets . During training, we randomly sample the 30% subset from the SA-1B dataset [34], which contains ∼ 3 million images and ∼ 0.3 billion masks. Although this supervision provides diverse binary masks, it lacks the semantic class for each mask. In addition, following Chen et al. [8], we collect about 1.3 million image-text pairs and use a large vision-language model to refine them. Afterward, we use the ChatGPT-based parser to extract entities with descriptive words from these text descriptions. Datasets Evaluation & metrics. We evaluate our model mainly on three tasks, including open-vocabulary semantic segmentation, open-vocabulary panoptic segmentation, and promptable segmentation. Following previous work [76], we adopt prompt engineering from [21, 66] and prompt templates from [22, 37]. For open-vocabulary semantic segmentation, we zero-shot evaluate the model on the COCO [40], ADE20K [84], PASCAL [18] datasets. The open-vocabulary semantic segmentation results are evaluated with the mean Intersection-over-Union (mIoU). For open-vocabulary panoptic segmentation, we evaluate the model on the COCO, ADE20K, and Cityscapes [15] datasets. We report the panoptic quality (PQ), semantic quality (SQ), and recognition quality (RQ) for open-vocabulary panoptic segmentation. For promptable segmentation, we report the 1-Point and 1-Box IoU (Oracle) on a wide range of datasets. Oracle denotes that we select the output mask with the max IoU by calculating the IoU between the prediction and target mask. More details can be found in Appendix B. Evaluation & metrics. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Datasets and Evaluation Methods for Open-Vocabulary Segmentation Tasks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Advanced Open-Vocabulary Segmentation with Uni-OVSeg

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

The Future of Segmentation: Low-Cost Annotation Meets High Performance

Defining Open-Vocabulary Segmentation: Problem Setup, Baseline, and the Uni-OVSeg Framework

he Baseline and Uni-OVSeg Framework for Open-Vocabulary Segmentation

Uni-OVSeg Outperforms Weakly-Supervised and Fully-Supervised Methods in Open-Vocabulary Segmentation

Advanced Open-Vocabulary Segmentation with Uni-OVSeg

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

The Future of Segmentation: Low-Cost Annotation Meets High Performance

Defining Open-Vocabulary Segmentation: Problem Setup, Baseline, and the Uni-OVSeg Framework

he Baseline and Uni-OVSeg Framework for Open-Vocabulary Segmentation

Uni-OVSeg Outperforms Weakly-Supervised and Fully-Supervised Methods in Open-Vocabulary Segmentation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps