Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 1 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Supplementary Material (Part 1)
Supplementary Material (Part 2)
Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.
Fueled by the generative AI wave, Large Multimodal Models (LMMs) have emerged as a pivotal advancement, bridging the gap between vision and language tasks [2]. Initial efforts like [6, 8, 22, 29, 52, 61] demonstrate effective textual responses based on input images. Although these models are sophisticated, they cannot still ground their responses in the visual context. Such grounding is crucial for advanced applications like detailed visual understanding, interactive embodied agents, and localized content manipulation. Recent efforts have started to address this limitation by enabling models to process user-defined regions specified via bounding boxes [5, 31, 35, 36, 57].
A few recent works have explored grounded text response generation [5, 21, 35, 59] but do not provide detailed pixel-level groundings. Parallel to these, efforts have been made in the referring segmentation literature to ground textual descriptions in natural images [21]. However, they are limited to grounding a single object and cannot engage in natural, coherent conversations, thereby restricting their practical applicability in interactive tasks that demand a deep understanding of both visual and textual content. To address these limitations of existing works, we introduce Grounding LMM (GLaMM), that simultaneously provides in-depth region understanding, pixel-level groundings, and conversational abilities through an end-to-end training approach (see Fig. 1 and Tab. 1).
To address the lack of benchmarks for visually grounded conversations, we introduce the novel task of Grounded Conversation Generation (GCG). The GCG task aims to produce natural language responses interleaved with object segmentation masks. This challenging task unifies several existing tasks in computer vision that are typically treated in isolation, i.e., referring expression segmentation, image and region-level captioning, phrase grounding, and vision-language conversations. Thereby, our unified model and proposed pretraining dataset can effectively transfer to several downstream tasks (referring expression segmentation, region-level captioning, image captioning, and conversational-style QA). We present GLaMM as the first model specifically designed for this challenging task. Unlike prior works, GLaMM can work with both textual and visual prompts and can generate visually grounded outputs, thus offering a versatile user experience.
Detailed region-level understanding requires the laborious process of collecting large-scale annotations for image regions. We propose an automated pipeline to annotate the large-scale Grounding-anything Dataset (GranD) to alleviate the manual labeling effort. Leveraging the automated pipeline with dedicated verification steps, GranD comprises 7.5M unique concepts anchored in 810M regions, each with a segmentation mask. Using state-ofthe-art vision and language models, the dataset annotates SAM [18] images through a multi-level hierarchical scheme that enhances annotation quality. With 11M images, 84M referring expressions, and 33M grounded captions, GranD sets a new benchmark in comprehensiveness. In addition to the automatically generated dataset for the GCG, we provide the first high-quality dataset for grounded conversations obtained by revamping the existing manually annotated datasets [16, 37, 49] for GCG using GPT-4 [34] incontext learning. We refer to the high-quality dataset as GranDf , denoting its suitability for fine-tuning.
Our work has three main contributions:
• We present GLaMM, the first model capable of generating natural language responses seamlessly integrated with object segmentation masks. Unlike existing models, GLaMM accommodates textual and visual prompts, facilitating enhanced multimodal user interaction.
• Recognizing the lack of standardized benchmarks for visually grounded conversations, we propose the new Grounded Conversation Generation (GCG) task. We also introduce a comprehensive evaluation protocol to measure the efficacy of models for GCG that unifies multiple isolated tasks, filling a significant gap in the literature.
• To facilitate model training and evaluation, we create Grounding-anything Dataset (GranD), a large-scale densely annotated dataset. Developed using an automatic annotation pipeline and verification criteria, it encompasses 7.5M unique concepts grounded in 810M regions. Additionally, we propose GranDf , a high-quality dataset explicitly designed for the GCG task finetuning, by repurposing existing open-source datasets.
This paper is available on arxiv under CC BY 4.0 DEED license.