Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 3 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Supplementary Material (Part 1)
Supplementary Material (Part 2)
Existing Large Multimodal Models (LMMs) either generate ungrounded text or are restricted by limitations such as single-object grounding, user-specified region inputs, or the lack of dense pixel-level object grounding (see Tab. 1). Our Grounding LMM (GLaMM) aims to overcome these limitations by generating natural language responses seamlessly integrated with object segmentation masks. This enables a visually grounded human-machine conversation.
GLaMM consists of five core components: i) Global Image Encoder, ii) Region Encoder, iii) LLM, iv) Grounding Image Encoder, and v) Pixel Decoder. These components are cohesively designed to handle both textual and optional visual prompts (image level and region), allowing for interaction at multiple levels of granularity and generating grounded text responses (see Fig. 2). These blocks together enable scene-level, region-level, and pixel-level grounding, as explained next. Training specifics are detailed in Appendix A.2.
Region-Level Understanding: Building on the shortcomings of existing models that can handle only image-level
visual inputs, and in alignment with recent work [57], the region encoder (R) extends the model’s capability to interpret and interact with user-specified regions in an image. This component constructs a hierarchical feature pyramid from four selected CLIP global image encoder layers, followed by RoIAlign [10] to generate a 14x14 feature map. Combining these features yields a unified region-of-interest (RoI) representation. To facilitate region-targeted responses from GLaMM, we augment the existing vocabulary with a specialized token <bbox>. This is integrated into a prompt like, “The <image> provides an overview of the image. Can you provide a detailed description of the region <bbox>?”. Here the <bbox> token is replaced with the RoI extracted features.
Pixel-Level Grounding: Utilizing the grounding image encoder denoted as V and the pixel decoder represented as P, GLaMM facilitates fine-grained pixel-level object grounding, allowing it to ground its responses visually. We instantiate V with a pretrained SAM encoder [18] and design P based on a SAM decoder-like architecture. To activate the pixel-level grounding, our model’s vocabulary is augmented with a specialized token, <SEG>. Prompts, such as “Please segment the ‘man in red’ in the given image," trigger the model to generate responses with corresponding <SEG> tokens. A language-to-prompt (L-P) projection layer (g) transforms the last-layer embeddings corresponding to <SEG> tokens (lseg) into the decoder’s feature space. Subsequently, P produces binary segmentation masks M,
Using an end-to-end training approach, GLaMM excels in region understanding, pixel-level grounding, and conversational capabilities. However, due to the lack of standard benchmarks for the novel setting of generating visually grounded detailed conversations, we introduce a novel task, Grounded Conversation Generation (GCG), and a comprehensive evaluation protocol as explained next.
The objective of the GCG task is to construct image-level captions with specific phrases directly tied to corresponding segmentation masks in the image. For example, “<A man> and <a boy> sit on <a bench> next to <an old white car>.”, shown in Fig. 3 (left), features how each bracketed phrase (highlighted in the image) is anchored to a unique image segmentation mask. This creates a densely annotated caption that aligns textual descriptions with visual regions, enriching the image’s contextual interpretation.
GCG Output Representation: A sample prompt for querying the model in this task is: “Could you please
give me a detailed description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer.” The model generates a detailed caption along with interleaved segmentation masks, employing the format “<p>A man</p><SEG> and <p>a boy </p><SEG> sit on <p>a bench</p><SEG> next to <p>an old white car </p><SEG>.” We use special tokens, namely <p>, </p> and <SEG>, to delineate the start and end of each phrase and its corresponding region mask, respectively.
Our GranD dataset is meticulously constructed using a stage-wise annotation pipeline, capturing annotations that range from fine-grained specifics to high-level context. This enables the automatic generation of densely annotated captions well-suited for the GCG task, thereby significantly facilitating GLaMM’s training for this task. Some qualitative results of our model on the GCG task are shown in Fig. 3.
Evaluation Criteria: We introduce a benchmarking suite for GCG, with a validation set of 2.5K images and a test set of 5K images. Four key aspects are evaluated: i) generated dense caption quality, ii) mask-to-phrase correspondence accuracy, iii) generated mask quality, and iv) region-specific grounding ability. Metrics include METEOR and CIDEr for captions, class-agnostic mask AP for grounding, mask IoU for segmentation, and mask recall for region-specific grounding (refer to Appendix A.1 for details).
Having delineated the architecture of GLaMM and the intricacies of the GCG task, it becomes imperative to address the scarcity of large-scale annotated data for regionlevel understanding. We next focus on devising a new, densely annotated dataset to optimize the model’s performance and overcome this data limitation.
This paper is available on arxiv under CC BY 4.0 DEED license.