Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Work
3. Method
4. Data Annotation Pipeline
5. Experiments
6. Conclusion and References Supplementary Material (Part 1) A. Additional Implementation Details
B. Additional Downstream Tasks
C. Additional Qualitative Results Supplementary Material (Part 2) D. Dataset Visualization
E. Limitations and Future Work
F. Ethics and Societal Impact 5. Experiments We perform quantitative evaluations of GLaMM on six benchmarks: i) Grounded Conversation Generation (GCG), ii) referring-expression segmentation, iii) region-level captioning, iv) image-level captioning, v) conversational-style question answering and vi) phrase grounding. We present the first four benchmarks next, and the remaining are discussed in Appendix B. Referring Expression Segmentation. In this task, the model processes an image and a text-based referring expression to output a segmentation mask. The prompt used is, “Please segment the <referring expression> in the image.” The model responds with "“Sure, it is <SEG>.“, where the <SGE> token is decoded to obtain the mask. We achieve better results over recent works like LISA on the refCOCO, refCOCO+, and refCOCOg validation and test sets in Tab. 4. This demonstrates the efficacy of our GranD dataset, offering the model extensive concept vocabulary during pre-training (refer to Fig. 5 (middle) and supplementary Fig. 8 for qualitative results). Region Level Captioning. In this task, models generate region-specific captions given an image, a user-specified region via a bounding box and related text. We utilize a prompt like, “Can you provide a detailed description of the region <bbox>?”, to instruct the model for this task, where the special token <bbox> is replaced with the actual region representations. We evaluate GLaMM on Visual Genome and refCOCOg, using METEOR and CIDEr metrics with results presented in Tab. 5. GLaMM shows improved results over GRiT and GPT4RoI after fine-tuning and demonstrates robust zero-shot performance, highlighting the significance of GranD’s region-text pairs (refer to Fig.5 (left) and supplementary Fig.9 for qualitative results). Image Level Captioning. For this task, GLaMM responds to queries like, “Could you please give me a detailed description of the image?" with a textual description. We evaluate GLaMM’s zero-shot performance on Flickr30k [37] and NoCap [1] datasets, with Tab. 6 showing its favorable performance against recent image captioning models and other LMMs (refer to Fig. 5 (right) and supplementary Fig. 10 for qualitative results). Refer to Appendix C for qualitative results on six downstream tasks, as well as conditional image generation. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Work 3. Method 4. Data Annotation Pipeline 5. Experiments 6. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. Method 3. Method 4. Data Annotation Pipeline 4. Data Annotation Pipeline 5. Experiments 5. Experiments 6. Conclusion and References 6. Conclusion and References Supplementary Material (Part 1) Supplementary Material (Part 1) A. Additional Implementation Details B. Additional Downstream Tasks C. Additional Qualitative Results A. Additional Implementation Details A. Additional Implementation Details B. Additional Downstream Tasks B. Additional Downstream Tasks C. Additional Qualitative Results C. Additional Qualitative Results Supplementary Material (Part 2) Supplementary Material (Part 2) D. Dataset Visualization E. Limitations and Future Work F. Ethics and Societal Impact D. Dataset Visualization D. Dataset Visualization E. Limitations and Future Work E. Limitations and Future Work F. Ethics and Societal Impact F. Ethics and Societal Impact 5. Experiments We perform quantitative evaluations of GLaMM on six benchmarks: i) Grounded Conversation Generation (GCG), ii) referring-expression segmentation, iii) region-level captioning, iv) image-level captioning, v) conversational-style question answering and vi) phrase grounding. We present the first four benchmarks next, and the remaining are discussed in Appendix B. Referring Expression Segmentation. In this task, the model processes an image and a text-based referring expression to output a segmentation mask. The prompt used is, “Please segment the <referring expression> in the image.” The model responds with "“Sure, it is <SEG>.“, where the <SGE> token is decoded to obtain the mask. We achieve better results over recent works like LISA on the refCOCO, refCOCO+, and refCOCOg validation and test sets in Tab. 4. This demonstrates the efficacy of our GranD dataset, offering the model extensive concept vocabulary during pre-training (refer to Fig. 5 (middle) and supplementary Fig. 8 for qualitative results). Referring Expression Segmentation. Region Level Captioning. In this task, models generate region-specific captions given an image, a user-specified region via a bounding box and related text. We utilize a prompt like, “Can you provide a detailed description of the region <bbox>?”, to instruct the model for this task, where the special token <bbox> is replaced with the actual region representations. We evaluate GLaMM on Visual Genome and refCOCOg, using METEOR and CIDEr metrics with results presented in Tab. 5. GLaMM shows improved results over GRiT and GPT4RoI after fine-tuning and demonstrates robust zero-shot performance, highlighting the significance of GranD’s region-text pairs (refer to Fig.5 (left) and supplementary Fig.9 for qualitative results). Region Level Captioning. Image Level Captioning. For this task, GLaMM responds to queries like, “Could you please give me a detailed description of the image?" with a textual description. We evaluate GLaMM’s zero-shot performance on Flickr30k [37] and NoCap [1] datasets, with Tab. 6 showing its favorable performance against recent image captioning models and other LMMs (refer to Fig. 5 (right) and supplementary Fig. 10 for qualitative results). Image Level Captioning. Refer to Appendix C for qualitative results on six downstream tasks, as well as conditional image generation. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

New AI Model Could Redefine How Machines Describe Images

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

How GPT-4 Built a New Multimodal Model

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

How GPT-4 Built a New Multimodal Model

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps