Authors:
(1) Zhaoqing Wang, The University of Sydney and AI2Robotics;
(2) Xiaobo Xia, The University of Sydney;
(3) Ziye Chen, The University of Melbourne;
(4) Xiao He, AI2Robotics;
(5) Yandong Guo, AI2Robotics;
(6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence;
(7) Tongliang Liu, The University of Sydney. Table of Links Abstract and 1. Introduction 2. Related works 3. Method and 3.1. Problem definition 3.2. Baseline and 3.3. Uni-OVSeg framework 4. Experiments 4.1. Implementation details 4.2. Main results 4.3. Ablation study 5. Conclusion 6. Broader impacts and References A. Framework details B. Promptable segmentation C. Visualisation 2. Related works Generic segmentation. Given an image, segmentation of specific visual concepts has remained an ongoing research topic in computer vision, as indicated by the extensive literature on it [23, 33, 44]. Generic segmentation mainly includes semantic segmentation [12, 44, 77, 79], instance segmentation [4, 7, 23], and panoptic segmentation [11, 11, 33, 56], related to different levels of granularity. In more detail, semantic segmentation [12, 20, 28, 38, 78] aims to assign a label to each pixel of the input image, according to their respective semantic classes. In addition, instance segmentation [53, 54, 59] attempts to distinguish different object instances of the same semantic class. Panoptic segmentation [9, 57, 74, 75] combines the characteristics of semantic segmentation and instance segmentation. Following a close-vocabulary assumption, previous works only can predict a predefined set of object categories. In this paper, we aim to build an advanced open-vocabulary segmentation system, which can categorise objects and stuff from an open set of vocabulary in the real world. Vision foundation models. Recent advancements in visual foundation models have diversified optimisation techniques across various learning paradigms. These developments range from vision-only pretraining [2, 24, 25, 61, 62] to joint vision-language pre-training [30, 48, 73], and extend to multi-modal frameworks that integrate visual prompting [1]. A prime example of this evolution is SAM [34], which shows the potential of extensive training for general segmentation, offering impressive generalisability and scalability. Despite its impressive capabilities, SAM cannot categorise predicted masks into different semantic classes, which is limited by the supervision of the image-mask pairs. More recently, Semantic-SAM [36] unifies different sources of human-annotated segmentation datasets and augments SAM by adding semantic labels and increased levels of granularity. In our work, our aim is to develop a more flexible vision foundation model, which can be trained with unpaired mask-text supervision (e.g., independent imagemask and image-text pairs) and can be easily adapted to different segmentation tasks. Open-vocabulary segmentation. Open-vocabulary segmentation counters the constraints of closed-vocabulary segmentation by allowing the segmentation of a diverse range of classes, even those unseen during training [21, 67, 68, 80]. Existing works [17, 66, 76] leverage the pretrained vision-language models (e.g., CLIP [48] and ALIGN [30]) to perform open-vocabulary segmentation. Most open-vocabulary segmentation methods commonly utilise human-annotated supervision (i.e., the image-masktext triplets) to generalise the capability of vision-language models from the image level to the pixel level. To reduce the dependency on this labour-intensive supervision, some weakly-supervised methods are proposed to use only text supervisions [46, 65]. They learn to group image regions into shaped segments, but struggle to distinguish different instances with the same semantic class and the segmentation performance is unsatisfactory [64, 85]. This dilemma drives our pursuit of more advanced open-vocabulary segmentation framework, where the aim is to enjoy as low annotation cost as possible and simultaneously achieve significant performance. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Zhaoqing Wang, The University of Sydney and AI2Robotics; (2) Xiaobo Xia, The University of Sydney; (3) Ziye Chen, The University of Melbourne; (4) Xiao He, AI2Robotics; (5) Yandong Guo, AI2Robotics; (6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence; (7) Tongliang Liu, The University of Sydney. Authors: Authors: (1) Zhaoqing Wang, The University of Sydney and AI2Robotics; (2) Xiaobo Xia, The University of Sydney; (3) Ziye Chen, The University of Melbourne; (4) Xiao He, AI2Robotics; (5) Yandong Guo, AI2Robotics; (6) Mingming Gong, The University of Melbourne and Mohamed bin Zayed University of Artificial Intelligence; (7) Tongliang Liu, The University of Sydney. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Related works 2. Related works 3. Method and 3.1. Problem definition 3. Method and 3.1. Problem definition 3.2. Baseline and 3.3. Uni-OVSeg framework 3.2. Baseline and 3.3. Uni-OVSeg framework 4. Experiments 4.1. Implementation details 4.1. Implementation details 4.2. Main results 4.2. Main results 4.3. Ablation study 4.3. Ablation study 5. Conclusion 5. Conclusion 6. Broader impacts and References 6. Broader impacts and References A. Framework details A. Framework details B. Promptable segmentation B. Promptable segmentation C. Visualisation C. Visualisation 2. Related works Generic segmentation. Given an image, segmentation of specific visual concepts has remained an ongoing research topic in computer vision, as indicated by the extensive literature on it [23, 33, 44]. Generic segmentation mainly includes semantic segmentation [12, 44, 77, 79], instance segmentation [4, 7, 23], and panoptic segmentation [11, 11, 33, 56], related to different levels of granularity. In more detail, semantic segmentation [12, 20, 28, 38, 78] aims to assign a label to each pixel of the input image, according to their respective semantic classes. In addition, instance segmentation [53, 54, 59] attempts to distinguish different object instances of the same semantic class. Panoptic segmentation [9, 57, 74, 75] combines the characteristics of semantic segmentation and instance segmentation. Following a close-vocabulary assumption, previous works only can predict a predefined set of object categories. In this paper, we aim to build an advanced open-vocabulary segmentation system, which can categorise objects and stuff from an open set of vocabulary in the real world. Generic segmentation. Vision foundation models. Recent advancements in visual foundation models have diversified optimisation techniques across various learning paradigms. These developments range from vision-only pretraining [2, 24, 25, 61, 62] to joint vision-language pre-training [30, 48, 73], and extend to multi-modal frameworks that integrate visual prompting [1]. A prime example of this evolution is SAM [34], which shows the potential of extensive training for general segmentation, offering impressive generalisability and scalability. Despite its impressive capabilities, SAM cannot categorise predicted masks into different semantic classes, which is limited by the supervision of the image-mask pairs. Vision foundation models. More recently, Semantic-SAM [36] unifies different sources of human-annotated segmentation datasets and augments SAM by adding semantic labels and increased levels of granularity. In our work, our aim is to develop a more flexible vision foundation model, which can be trained with unpaired mask-text supervision (e.g., independent imagemask and image-text pairs) and can be easily adapted to different segmentation tasks. Open-vocabulary segmentation. Open-vocabulary segmentation counters the constraints of closed-vocabulary segmentation by allowing the segmentation of a diverse range of classes, even those unseen during training [21, 67, 68, 80]. Existing works [17, 66, 76] leverage the pretrained vision-language models (e.g., CLIP [48] and ALIGN [30]) to perform open-vocabulary segmentation. Most open-vocabulary segmentation methods commonly utilise human-annotated supervision (i.e., the image-masktext triplets) to generalise the capability of vision-language models from the image level to the pixel level. To reduce the dependency on this labour-intensive supervision, some weakly-supervised methods are proposed to use only text supervisions [46, 65]. They learn to group image regions into shaped segments, but struggle to distinguish different instances with the same semantic class and the segmentation performance is unsatisfactory [64, 85]. This dilemma drives our pursuit of more advanced open-vocabulary segmentation framework, where the aim is to enjoy as low annotation cost as possible and simultaneously achieve significant performance. Open-vocabulary segmentation. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

The Future of Segmentation: Low-Cost Annotation Meets High Performance

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Advanced Open-Vocabulary Segmentation with Uni-OVSeg

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Defining Open-Vocabulary Segmentation: Problem Setup, Baseline, and the Uni-OVSeg Framework

he Baseline and Uni-OVSeg Framework for Open-Vocabulary Segmentation

Datasets and Evaluation Methods for Open-Vocabulary Segmentation Tasks

Uni-OVSeg Outperforms Weakly-Supervised and Fully-Supervised Methods in Open-Vocabulary Segmentation

Advanced Open-Vocabulary Segmentation with Uni-OVSeg

Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Defining Open-Vocabulary Segmentation: Problem Setup, Baseline, and the Uni-OVSeg Framework

he Baseline and Uni-OVSeg Framework for Open-Vocabulary Segmentation

Datasets and Evaluation Methods for Open-Vocabulary Segmentation Tasks

Uni-OVSeg Outperforms Weakly-Supervised and Fully-Supervised Methods in Open-Vocabulary Segmentation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps