Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Work
3. Method
4. Data Annotation Pipeline
5. Experiments
6. Conclusion and References Supplementary Material (Part 1) A. Additional Implementation Details
B. Additional Downstream Tasks
C. Additional Qualitative Results Supplementary Material (Part 2) D. Dataset Visualization
E. Limitations and Future Work
F. Ethics and Societal Impact B. Additional Downstream Tasks B.1. Phrase Grounding In order to adapt the GLaMM model for phrase grounding, we repurpose the GCG dataset to suit this particular task. Specifically, the answers in the GCG dataset are now used as questions, and the parts of the captions containing groundings are regarded as phrases. The model is subsequently trained to locate pixel-level groundings for these phrases, which are enclosed within<p> and </p> tokens. The results of this adaptation are shown in the following figure. B.2. Conversational Style Question Answering We evaluate our model on the LLaVA-Bench [28, 29] that uses GPT-4 for evaluation of models. This benchmark tests the model on three different types of tasks: conversation question-answering, detailed descriptions, and complex reasoning tasks. The evaluation provides insights into the model’s conversational and reasoning capabilities. The results in Tab. 8 present a comparison of GLaMM with previous open-source models. We note that GLaMM performance is on par with the recently released LLaVA1.5 which leverages additional data for vision-to-language alignment. Qualitative results are shown in Fig. 11 and Fig. 13. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 8 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Work 3. Method 4. Data Annotation Pipeline 5. Experiments 6. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. Method 3. Method 4. Data Annotation Pipeline 4. Data Annotation Pipeline 5. Experiments 5. Experiments 6. Conclusion and References 6. Conclusion and References Supplementary Material (Part 1) Supplementary Material (Part 1) A. Additional Implementation Details B. Additional Downstream Tasks C. Additional Qualitative Results A. Additional Implementation Details A. Additional Implementation Details B. Additional Downstream Tasks B. Additional Downstream Tasks C. Additional Qualitative Results C. Additional Qualitative Results Supplementary Material (Part 2) Supplementary Material (Part 2) D. Dataset Visualization E. Limitations and Future Work F. Ethics and Societal Impact D. Dataset Visualization D. Dataset Visualization E. Limitations and Future Work E. Limitations and Future Work F. Ethics and Societal Impact F. Ethics and Societal Impact B. Additional Downstream Tasks B.1. Phrase Grounding In order to adapt the GLaMM model for phrase grounding, we repurpose the GCG dataset to suit this particular task. Specifically, the answers in the GCG dataset are now used as questions, and the parts of the captions containing groundings are regarded as phrases. The model is subsequently trained to locate pixel-level groundings for these phrases, which are enclosed within<p> and </p> tokens. The results of this adaptation are shown in the following figure. B.2. Conversational Style Question Answering We evaluate our model on the LLaVA-Bench [28, 29] that uses GPT-4 for evaluation of models. This benchmark tests the model on three different types of tasks: conversation question-answering, detailed descriptions, and complex reasoning tasks. The evaluation provides insights into the model’s conversational and reasoning capabilities. The results in Tab. 8 present a comparison of GLaMM with previous open-source models. We note that GLaMM performance is on par with the recently released LLaVA1.5 which leverages additional data for vision-to-language alignment. Qualitative results are shown in Fig. 11 and Fig. 13. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

UAE Researchers Reveal the Secrets Behind an AI That Truly Understands Images

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps