Authors:
(1) Hanoona Rasheed, Mohamed bin Zayed University of AI and equally contributing first authors;
(2) Muhammad Maaz, Mohamed bin Zayed University of AI and equally contributing first authors;
(3) Sahal Shaji, Mohamed bin Zayed University of AI;
(4) Abdelrahman Shaker, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Hisham Cholakkal, Mohamed bin Zayed University of AI;
(7) Rao M. Anwer, Mohamed bin Zayed University of AI and Aalto University;
(8) Eric Xing, Mohamed bin Zayed University of AI and Carnegie Mellon University;
(9) Ming-Hsuan Yang, University of California - Merced and Google Research;
(10) Fahad S. Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 7 of 10 of a study detailing the development of an AI model that is designed to describe images to users. Read the rest below.
Supplementary Material (Part 1)
Supplementary Material (Part 2)
We provide supplementary material for a deeper understanding and more analysis related to the main paper, arranged as follows:
Mask Recall: To quantify region-specific grounding, we propose a ‘mask recall’ metric, utilizing a two-tiered validation approach. Initially, predicted masks are mapped to ground-truth masks via a one-to-one set assignment, followed by IoU computation for these pairs. Pairs surpassing a 0.5 IoU threshold proceed to a textual similarity assessment using BERT. A pair is considered a true positive (TP) only if both IoU and BERT similarity exceed their 0.5 thresholds; otherwise, it is classified as a false positive (FP). The mask recall is subsequently calculated using the standard formula, normalizing the number of TPs by the total ground-truth mask count.
In all of our experiments, we use Vicuna LLM [60] with 7B parameters. The design of region encoder is motivated from GPT4RoI [57] and grounding image encoder and pixel decoder are inspired from LISA [21]. The V-L and L-P layers are implemented using 2 layer MLP with GELU activation as in LLaVA-v1.5 [28]. We use PyTorch to implement our GLaMM and use Deepspeed zero-2 optimization during training.
Specifically, our model is trained using two types of losses: auto-regressive cross-entropy loss for text generation and a linear combination of per-pixel binary crossentropy loss and DICE loss for segmentation. During training, the global image encoder and grounding image encoder are kept frozen and the region encoder, projection layers (VL and L-P) and the pixel decoder are fully finetuned, while the LLM is LORA finetuned with α = 8. Our codes and pretrained models will be publicly released.
A.2.1 Pretraining on GranD
During pretraining GLaMM is trained on GranD dataset for referring expression segmentation, region-level captioning, image-level captioning and grounded conversation generation (GCG) tasks simultaneously. We use a batch size of 160 and train for a total of 35K iterations during pretraining. We use LORA-8 for efficiently adapting the LLM and initialize the pretraining from GPT4RoI [57] for faster convergence. In the experiment tables in Section. 5, we refer to this model as GLaMM (ZS) which is obtained after pretraining on GranD.
We finetune GLaMM on multiple downstream tasks including GCG, referring expression segmentation, region-level captioning and image-level captioning. For GCG, we finetune our model on GranDf dataset. A batch size of 160 is used and the model is trained for 5K iterations in total. It is worth noting that GranDf dataset is a combination of multiple open-source datasets that we repurposed for GCG task using GPT4 [34]. Please refer to Appendix. D for the prompts designed to query GPT4 for constructing GranDf dataset, along with the dataset visualizations.
For referring expressions segmentation, we finetune GLaMM on refCOCO, refCOCO+ and refCOCOg datasets. We represent this model as GLaMM (FT) in Tab. 4. Similarly, for region-level captioning, GLaMM (FT) is finetuned on refCOCOg and Visual Genome datasets. For imagelevel captioning, we fine tune GLaMM on LLaVA-Instruct150K [29] dataset. For LLaVA-bench, the model is finetuned on LLaVA-Instruct-80K [29] instruction set. We use eight NVIDIA A100-40GB GPUs in all of our pretraining and finetuning experiments.
Our automated annotation pipeline incorporates diverse state-of-the-art models at various levels. For Level-1, we use Tag2Text [14] and RAM [58] for image tagging, CoDETR [62], EVAv02 [7], OWL-ViT [33], and POMP [40] for object localization, GRiT [48] and GPT4RoI [57] for attribute generation, and MiDAS [39] for depth estimation. Level-2 leverages BLIP-2 [24] and LLaVA-v1.5 [28, 29] for scene descriptions and landmark categorization, SpaCy [11] for phrase extraction, and MDETR [15] for phrase grounding. For both Level-3 and Level-4, we use Vicuna-v1.5 [60] with 13B parameters, supplemented with in-context examples. Please refer to Appendix A.4 for further details on
implementation and LLM prompts used across different pipeline levels.
We design a fully automated dataset annotation pipeline using multiple hierarchical levels in the visual domain to construct GranD dataset. The segmentation masks for most of the regions are obtained from SAM [18] annotations by comparing our detected labeled regions with SAMprovided class-agnostic regions. For the remaining regions that do not match with any of the SAM regions, we run SAM model with a bounding box query to obtain masks.
Our automated annotation pipeline utilizes only opensource models and incorporates a feedback loop using the chain of thoughts prompting via LLM. As it does not require feedback from the human in the loop, it can be scaled to generate dense noisy labels for a larger number of images, which can then be used to pretrain a larger LMM. Given the availability of enough compute power, this could be a step towards building a larger generic large multi-modal model. We will release our GranD dataset along with the implementation of our automated dataset annotation pipeline for further research. Below we present the LLM prompts we use at different levels of our automated dataset annotation pipeline.
A.4.1 LLM Prompts and In-context Learning
Landmark categorization: We use LLaVA-v1.5-13B [28] model to assign landmark categories to each image. Please refer to Tab. 7 for primary and fine categories used.
Dense Captioning: We arrange objects, attributes and relationships hierarchically to construct a visual scene graph, that is used to query Vicuna-v1.5-13B [60] model along with in-context examples to generate dense captions. The designed prompt is shown in Fig. 6 (a).
Extra Context: We query Vicuna-v1.5-13B model to generate additional context about the visual scene. The prompt designed for this purpose is shown in Fig. 6 (b).
This paper is available on arxiv under CC BY 4.0 DEED license.