Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 9 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Supplementary Material (Part 1)
Supplementary Material (Part 2)
In this section, we provide more qualitative examples to better understand the capacity of GLaMM.
Fig. 7 shows qualitative results of GLaMM finetuned on GranDf dataset. The model could produce dense captions and provide dense pixel-level groundings of the caption.
Fig. 8 shows the effectiveness of GLaMM in understanding the natural language query and segmenting the corresponding objects. Note that GLaMM can also segment multiple objects via multi-round conversations.
Fig. 9 shows the qualitative results of GLaMM for regionlevel understanding. Our model can generate detailed descriptions about the user-specified regions in an image.
Fig. 10 shows GLaMM’s qualitative results on captioning tasks. Our model can generate dense captions for images.
Fig. 12 shows GLaMM’s seamless integration for generative tasks. We use the Stable Diffusion inpainting model stable-diffusion-xl-1.0-inpainting [41] for this task. We first generate a segmentation mask using our GlaMM model based on the user query. This segmentation mask along with the user prompt is given as the input to the Stable Diffusion inpainting model, which generates the final output.
Figure 14. Dataset samples from GranDf. The figure shows the GPT4 [34] prompts used and the created dataset samples from Grandf dataset. This repurposed human-annotated dataset provides rich semantics to GLaMM for GCG task.
Fig. 13 illustrates the unique functionality of GLaMM to engage in multi-purpose task conversations. GLaMM is a generic conversational model that can accept prompts in the form of text and/or region and can answer in the form of text and/or segmentation masks. Note that our model is not explicitly trained to handle such scenarios, and this behavior emerges mainly due to our pretraining on GranD dataset, where an image is presented to LMM in different contexts.
This paper is available on arxiv under CC BY 4.0 DEED license.