Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
Recent advancements in Large Multimodal Models (LMMs) [8, 18, 50] and Large Language Models (LLMs) [7, 26, 36] have significantly transformed the artificial intelligence landscape, particularly in natural language processing and multimodal tasks. These breakthroughs have enhanced machine learning models’ ability to understand and generate human-like text, while also enabling more effective integration of various data types like images, sounds and videos with textual information. This progress represents a major leap in creating AI systems that can accurately interpret and interact with a diverse range of content.
Large Language Models (LLMs): The natural language processing (NLP) field has undergone a revolution with the advent of LLMs such as GPT [4], LLaMA [36], OPT [46], and MOSS [27], particularly noted for their zero-shot learning abilities and adaptability. The development of models like InstructGPT [28] and ChatGPT [26] has further propelled advancements in conversational AI and complex query handling, chiefly through instruction tuning. Within the LLaMA framework, the emergence of opensource models such as Alpaca [35] and Vicuna [7] exemplifies how instruction tuning can significantly boost model performance. This shift towards open-source initiatives in language modeling, highlighted by models like Alpaca and Vicuna, indicates a growing trend towards more accessible and collaborative approaches in the field. In this work, we build on the open-source Vicuna LLM and extend it with multimodal capabilities. We also propose an open-source benchmark for video conversation and reasoning tasks using Vicuna LLM that is reproducible for fair evaluations.
Large Multimodal Models (LMMs): The field of AI has witnessed significant advancements with the development of vision-language models like CLIP [30], renowned for their impressive zero-shot capabilities using extensive image-text pairs during training. These models have proven effective in a variety of applications, from image detection and segmentation [3, 17] to more complex tasks such as 3D modeling and video analysis [23, 31, 33, 37]. The introduction of BLIP-2 marked a pivotal transition, pioneering the integration of image features encoded by a visual encoder with text embeddings, setting the stage for the evolution into Large Multimodal Models (LMMs). This advancement influenced subsequent models like LLaVA [19], InstructBLIP [8], and MiniGPT-4 [50], which further refined image-text feature alignment and instruction tuning. VideoChat [15], Video-ChatGPT [22] and VideoLLaMA [45] represents an extension of these LMMs, moving from image-based to video-based applications, while models such as Otter [14], mPLUG-Owl [42], LLaMaAdapter [9], and InternGPT [21] continue to push the boundaries of multimodal interaction. Despite these significant strides, challenges in achieving robust visual grounding in LMMs highlight key areas for ongoing research and development in this dynamic field. Further, effective integration of audio signals within LMMs for comprehensive video understanding is an open research question that this work aims to address.
Visual-Language Grounding: Grounded Large Language Models (LLMs) have made notable progress in enhancing visual and language comprehension. A diverse array of models including Kosmos-2 [29], Ferret [43], AllSeeing Model [38], LISA [13], BuboGPT [49], Shikra [5], and GLaMM [32] have employed various methodologies to master complex grounding tasks. These models demonstrate proficiency in tasks like referring expression comprehension and image segmentation, showcasing the advanced image understanding capabilities of LLMs. Methodologically, Kosmos-2, Shikra, and All-Seeing focus predominantly on creating language-based context for visual grounding. In contrast, BuboGPT merges visual elements with language, and LISA leverages vision-language embeddings for producing segmentation masks. Furthermore, GLaMM is adept at generating natural language responses linked with object segmentation masks, facilitating detailed visual-textual interactions. However, challenges remain, such as LISA’s constrained performance in multi-object scenarios and the limitations of BuboGPT and GLaMM to image-based applications, not extending to video processing. To this end, we introduce PG-Video-LLaVA, a video conversational model with pixel-level grounding capability. Further, PG-Video-LLaVA incorporates audio transcripts alongside visual and textual data, aiming to provide a more detailed understanding of video content.
This paper is available on arxiv under CC BY 4.0 DEED license.