Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Work
3. Method
4. Data Annotation Pipeline
5. Experiments
6. Conclusion and References Supplementary Material (Part 1) A. Additional Implementation Details
B. Additional Downstream Tasks
C. Additional Qualitative Results Supplementary Material (Part 2) D. Dataset Visualization
E. Limitations and Future Work
F. Ethics and Societal Impact We provide supplementary material for a deeper understanding and more analysis related to the main paper, arranged as follows: Additional implementation details (Appendix A)
Additional downstream tasks (Appendix B
Additional qualitative results (Appendix C)
Dataset visualizations (Appendix D)
Limitations and future work (Appendix E)
Ethics and societal impact (Appendix F) A. Additional Implementation Details A.1. Evaluation Metrics Mask Recall: To quantify region-specific grounding, we propose a ‘mask recall’ metric, utilizing a two-tiered validation approach. Initially, predicted masks are mapped to ground-truth masks via a one-to-one set assignment, followed by IoU computation for these pairs. Pairs surpassing a 0.5 IoU threshold proceed to a textual similarity assessment using BERT. A pair is considered a true positive (TP) only if both IoU and BERT similarity exceed their 0.5 thresholds; otherwise, it is classified as a false positive (FP). The mask recall is subsequently calculated using the standard formula, normalizing the number of TPs by the total ground-truth mask count. A.2. Model Architecture and Training In all of our experiments, we use Vicuna LLM [60] with 7B parameters. The design of region encoder is motivated from GPT4RoI [57] and grounding image encoder and pixel decoder are inspired from LISA [21]. The V-L and L-P layers are implemented using 2 layer MLP with GELU activation as in LLaVA-v1.5 [28]. We use PyTorch to implement our GLaMM and use Deepspeed zero-2 optimization during training. Specifically, our model is trained using two types of losses: auto-regressive cross-entropy loss for text generation and a linear combination of per-pixel binary crossentropy loss and DICE loss for segmentation. During training, the global image encoder and grounding image encoder are kept frozen and the region encoder, projection layers (VL and L-P) and the pixel decoder are fully finetuned, while the LLM is LORA finetuned with α = 8. Our codes and pretrained models will be publicly released. A.2.1 Pretraining on GranD During pretraining GLaMM is trained on GranD dataset for referring expression segmentation, region-level captioning, image-level captioning and grounded conversation generation (GCG) tasks simultaneously. We use a batch size of 160 and train for a total of 35K iterations during pretraining. We use LORA-8 for efficiently adapting the LLM and initialize the pretraining from GPT4RoI [57] for faster convergence. In the experiment tables in Section. 5, we refer to this model as GLaMM (ZS) which is obtained after pretraining on GranD. A.3. Finetuning on Downstream Tasks We finetune GLaMM on multiple downstream tasks including GCG, referring expression segmentation, region-level captioning and image-level captioning. For GCG, we finetune our model on GranDf dataset. A batch size of 160 is used and the model is trained for 5K iterations in total. It is worth noting that GranDf dataset is a combination of multiple open-source datasets that we repurposed for GCG task using GPT4 [34]. Please refer to Appendix. D for the prompts designed to query GPT4 for constructing GranDf dataset, along with the dataset visualizations. For referring expressions segmentation, we finetune GLaMM on refCOCO, refCOCO+ and refCOCOg datasets. We represent this model as GLaMM (FT) in Tab. 4. Similarly, for region-level captioning, GLaMM (FT) is finetuned on refCOCOg and Visual Genome datasets. For imagelevel captioning, we fine tune GLaMM on LLaVA-Instruct150K [29] dataset. For LLaVA-bench, the model is finetuned on LLaVA-Instruct-80K [29] instruction set. We use eight NVIDIA A100-40GB GPUs in all of our pretraining and finetuning experiments. A.4. Automated Dataset Annotation Pipeline Our automated annotation pipeline incorporates diverse state-of-the-art models at various levels. For Level-1, we use Tag2Text [14] and RAM [58] for image tagging, CoDETR [62], EVAv02 [7], OWL-ViT [33], and POMP [40] for object localization, GRiT [48] and GPT4RoI [57] for attribute generation, and MiDAS [39] for depth estimation. Level-2 leverages BLIP-2 [24] and LLaVA-v1.5 [28, 29] for scene descriptions and landmark categorization, SpaCy [11] for phrase extraction, and MDETR [15] for phrase grounding. For both Level-3 and Level-4, we use Vicuna-v1.5 [60] with 13B parameters, supplemented with in-context examples. Please refer to Appendix A.4 for further details on implementation and LLM prompts used across different pipeline levels. We design a fully automated dataset annotation pipeline using multiple hierarchical levels in the visual domain to construct GranD dataset. The segmentation masks for most of the regions are obtained from SAM [18] annotations by comparing our detected labeled regions with SAMprovided class-agnostic regions. For the remaining regions that do not match with any of the SAM regions, we run SAM model with a bounding box query to obtain masks. Our automated annotation pipeline utilizes only opensource models and incorporates a feedback loop using the chain of thoughts prompting via LLM. As it does not require feedback from the human in the loop, it can be scaled to generate dense noisy labels for a larger number of images, which can then be used to pretrain a larger LMM. Given the availability of enough compute power, this could be a step towards building a larger generic large multi-modal model. We will release our GranD dataset along with the implementation of our automated dataset annotation pipeline for further research. Below we present the LLM prompts we use at different levels of our automated dataset annotation pipeline. A.4.1 LLM Prompts and In-context Learning Landmark categorization: We use LLaVA-v1.5-13B [28] model to assign landmark categories to each image. Please refer to Tab. 7 for primary and fine categories used. Dense Captioning: We arrange objects, attributes and relationships hierarchically to construct a visual scene graph, that is used to query Vicuna-v1.5-13B [60] model along with in-context examples to generate dense captions. The designed prompt is shown in Fig. 6 (a). Extra Context: We query Vicuna-v1.5-13B model to generate additional context about the visual scene. The prompt designed for this purpose is shown in Fig. 6 (b). This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors; (2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors; (3) Sahal Shaji, Mohamed bin Zayed University of AI; (4) Abdelrahman Shaker, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Hisham Cholakkal, Mohamed bin Zayed University of AI; (7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University; (8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University; (9) Ming-Hsuan Yang, University of California - Merced and Google Research; (10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Work 3. Method 4. Data Annotation Pipeline 5. Experiments 6. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. Method 3. Method 4. Data Annotation Pipeline 4. Data Annotation Pipeline 5. Experiments 5. Experiments 6. Conclusion and References 6. Conclusion and References Supplementary Material (Part 1) Supplementary Material (Part 1) A. Additional Implementation Details B. Additional Downstream Tasks C. Additional Qualitative Results A. Additional Implementation Details A. Additional Implementation Details B. Additional Downstream Tasks B. Additional Downstream Tasks C. Additional Qualitative Results C. Additional Qualitative Results Supplementary Material (Part 2) Supplementary Material (Part 2) D. Dataset Visualization E. Limitations and Future Work F. Ethics and Societal Impact D. Dataset Visualization D. Dataset Visualization E. Limitations and Future Work E. Limitations and Future Work F. Ethics and Societal Impact F. Ethics and Societal Impact We provide supplementary material for a deeper understanding and more analysis related to the main paper, arranged as follows: Additional implementation details (Appendix A) Additional downstream tasks (Appendix B Additional qualitative results (Appendix C) Dataset visualizations (Appendix D) Limitations and future work (Appendix E) Ethics and societal impact (Appendix F) Additional implementation details (Appendix A) Additional downstream tasks (Appendix B Additional qualitative results (Appendix C) Dataset visualizations (Appendix D) Limitations and future work (Appendix E) Ethics and societal impact (Appendix F) A. Additional Implementation Details A.1. Evaluation Metrics Mask Recall: To quantify region-specific grounding, we propose a ‘mask recall’ metric, utilizing a two-tiered validation approach. Initially, predicted masks are mapped to ground-truth masks via a one-to-one set assignment, followed by IoU computation for these pairs. Pairs surpassing a 0.5 IoU threshold proceed to a textual similarity assessment using BERT. A pair is considered a true positive (TP) only if both IoU and BERT similarity exceed their 0.5 thresholds; otherwise, it is classified as a false positive (FP). The mask recall is subsequently calculated using the standard formula, normalizing the number of TPs by the total ground-truth mask count. Mask Recall: A.2. Model Architecture and Training In all of our experiments, we use Vicuna LLM [60] with 7B parameters. The design of region encoder is motivated from GPT4RoI [57] and grounding image encoder and pixel decoder are inspired from LISA [21]. The V-L and L-P layers are implemented using 2 layer MLP with GELU activation as in LLaVA-v1.5 [28]. We use PyTorch to implement our GLaMM and use Deepspeed zero-2 optimization during training. Specifically, our model is trained using two types of losses: auto-regressive cross-entropy loss for text generation and a linear combination of per-pixel binary crossentropy loss and DICE loss for segmentation. During training, the global image encoder and grounding image encoder are kept frozen and the region encoder, projection layers (VL and L-P) and the pixel decoder are fully finetuned, while the LLM is LORA finetuned with α = 8. Our codes and pretrained models will be publicly released. A.2.1 Pretraining on GranD A.2.1 Pretraining on GranD During pretraining GLaMM is trained on GranD dataset for referring expression segmentation, region-level captioning, image-level captioning and grounded conversation generation (GCG) tasks simultaneously. We use a batch size of 160 and train for a total of 35K iterations during pretraining. We use LORA-8 for efficiently adapting the LLM and initialize the pretraining from GPT4RoI [57] for faster convergence. In the experiment tables in Section. 5, we refer to this model as GLaMM (ZS) which is obtained after pretraining on GranD. A.3. Finetuning on Downstream Tasks We finetune GLaMM on multiple downstream tasks including GCG, referring expression segmentation, region-level captioning and image-level captioning. For GCG, we finetune our model on GranD f dataset. A batch size of 160 is used and the model is trained for 5K iterations in total. It is worth noting that GranD f dataset is a combination of multiple open-source datasets that we repurposed for GCG task using GPT4 [34]. Please refer to Appendix. D for the prompts designed to query GPT4 for constructing GranDf dataset, along with the dataset visualizations. f f For referring expressions segmentation, we finetune GLaMM on refCOCO, refCOCO+ and refCOCOg datasets. We represent this model as GLaMM (FT) in Tab. 4. Similarly, for region-level captioning, GLaMM (FT) is finetuned on refCOCOg and Visual Genome datasets. For imagelevel captioning, we fine tune GLaMM on LLaVA-Instruct150K [29] dataset. For LLaVA-bench, the model is finetuned on LLaVA-Instruct-80K [29] instruction set. We use eight NVIDIA A100-40GB GPUs in all of our pretraining and finetuning experiments. A.4. Automated Dataset Annotation Pipeline Our automated annotation pipeline incorporates diverse state-of-the-art models at various levels. For Level-1, we use Tag2Text [14] and RAM [58] for image tagging, CoDETR [62], EVAv02 [7], OWL-ViT [33], and POMP [40] for object localization, GRiT [48] and GPT4RoI [57] for attribute generation, and MiDAS [39] for depth estimation. Level-2 leverages BLIP-2 [24] and LLaVA-v1.5 [28, 29] for scene descriptions and landmark categorization, SpaCy [11] for phrase extraction, and MDETR [15] for phrase grounding. For both Level-3 and Level-4, we use Vicuna-v1.5 [60] with 13B parameters, supplemented with in-context examples. Please refer to Appendix A.4 for further details on implementation and LLM prompts used across different pipeline levels. We design a fully automated dataset annotation pipeline using multiple hierarchical levels in the visual domain to construct GranD dataset. The segmentation masks for most of the regions are obtained from SAM [18] annotations by comparing our detected labeled regions with SAMprovided class-agnostic regions. For the remaining regions that do not match with any of the SAM regions, we run SAM model with a bounding box query to obtain masks. Our automated annotation pipeline utilizes only opensource models and incorporates a feedback loop using the chain of thoughts prompting via LLM. As it does not require feedback from the human in the loop, it can be scaled to generate dense noisy labels for a larger number of images, which can then be used to pretrain a larger LMM. Given the availability of enough compute power, this could be a step towards building a larger generic large multi-modal model. We will release our GranD dataset along with the implementation of our automated dataset annotation pipeline for further research. Below we present the LLM prompts we use at different levels of our automated dataset annotation pipeline. A.4.1 LLM Prompts and In-context Learning A.4.1 LLM Prompts and In-context Learning Landmark categorization : We use LLaVA-v1.5-13B [28] model to assign landmark categories to each image. Please refer to Tab. 7 for primary and fine categories used. Landmark categorization Dense Captioning: We arrange objects, attributes and relationships hierarchically to construct a visual scene graph, that is used to query Vicuna-v1.5-13B [60] model along with in-context examples to generate dense captions. The designed prompt is shown in Fig. 6 (a). Dense Captioning: Extra Context: We query Vicuna-v1.5-13B model to generate additional context about the visual scene. The prompt designed for this purpose is shown in Fig. 6 (b). Extra Context: This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

UAE Researchers Spill the Beans on How Their AI Comprehends Images in Detail

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

12 Key Aspects for Assessing the Power of Text-to-Image Models

AI as the "Bad Student" in Class

AI Will Not Kill Quantum Computing

AI's Unstoppable Energy Appetite: A Looming Crisis

Beyond the Algorithm: How Training Data Can Make or Break a Generative AI Model

Efficient AI Model Training: LongLoRA's Breakthrough in Handling Longer Texts

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps