Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Supplementary Material (Part 1)
Supplementary Material (Part 2)
We perform quantitative evaluations of GLaMM on six benchmarks: i) Grounded Conversation Generation (GCG), ii) referring-expression segmentation, iii) region-level captioning, iv) image-level captioning, v) conversational-style question answering and vi) phrase grounding. We present the first four benchmarks next, and the remaining are discussed in Appendix B.
Referring Expression Segmentation. In this task, the model processes an image and a text-based referring expression to output a segmentation mask. The prompt used is, “Please segment the <referring expression> in the image.” The model responds with "“Sure, it is <SEG>.“, where the <SGE> token is decoded to obtain the mask. We achieve better results over recent works like LISA on the refCOCO, refCOCO+, and refCOCOg validation and test sets in Tab. 4. This demonstrates the efficacy of our GranD dataset, offering the model extensive concept vocabulary during pre-training (refer to Fig. 5 (middle) and supplementary Fig. 8 for qualitative results).
Region Level Captioning. In this task, models generate region-specific captions given an image, a user-specified region via a bounding box and related text. We utilize a prompt like, “Can you provide a detailed description of the region <bbox>?”, to instruct the model for this task, where the special token <bbox> is replaced with the actual region representations. We evaluate GLaMM on Visual Genome and refCOCOg, using METEOR and CIDEr metrics with results presented in Tab. 5. GLaMM shows improved results over GRiT and GPT4RoI after fine-tuning and demonstrates robust zero-shot performance, highlighting the significance of GranD’s region-text pairs (refer to Fig.5 (left) and supplementary Fig.9 for qualitative results).
Image Level Captioning. For this task, GLaMM responds to queries like, “Could you please give me a detailed description of the image?" with a textual description. We evaluate GLaMM’s zero-shot performance on Flickr30k [37] and NoCap [1] datasets, with Tab. 6 showing its favorable performance against recent image captioning models and other LMMs (refer to Fig. 5 (right) and supplementary Fig. 10 for qualitative results).
Refer to Appendix C for qualitative results on six downstream tasks, as well as conditional image generation.
This paper is available on arxiv under CC BY 4.0 DEED license.