Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixellevel grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in VideoChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks.
Recent efforts on Large Multimodal Models (LMMs), spearheaded by GPT-4V [25], allow detailed conversations about images but generally do not scale well to videos. The magnitude of video data scales far beyond other modalities due to its massive volume on social and internet media. Furthermore, extending LMMs to videos is challenging due to their complex dynamics with long temporal context that needs to be understood accurately. Although recent
approaches towards video-LMMs such as VideoChat [15], Video-LLaMA [45], and Video-ChatGPT [22] have demonstrated capabilities in video comprehension and dialogue, they lack the crucial feature of visual grounding. Visual grounding in videos aims to associate the LMM responses to specific objects within the video input. Addressing this gap, we introduce PG-Video-LLaVA, the first video-LMM capable of localizing objects appearing in LMM responses. This task leads to enhanced intractability and demonstrates deep understanding of video content.
In PG-Video-LLaVA, we address the unique challenges posed by video data. The model is designed to track objects within shorter video clips that maintain consistent camera views, enabling accurate visual grounding across scenes and motions. This tracking links spatio-temporal segments directly to conversational elements, enhancing the model’s contextual understanding. A salient feature of PG-VideoLLaVA is its modular design, allowing for easy integration with existing grounding modules and the flexibility to adapt to future enhancements in visual grounding technology. Moreover, PG-Video-LLaVA enriches its capabilities by incorporating audio context. It achieves this by leveraging video audio in a form understandable to LLM, which is particularly useful in situations where the auditory information is essential to the conversation. This inclusion broadens the model’s understanding, making it more versatile in interpreting video content.
Furthermore, this work introduces an improved framework for benchmarking video-based conversational models, pivoting from previous approaches [22] that predominantly used the proprietary GPT-3.5-Turbo model for evaluation. Given that GPT-3.5-Turbo is subject to changes at any time and lacks transparency due to its closed-source nature, it presents challenges in terms of reliability and reproducibility. To address this, we propose the use of Vicuna, an open-source LLM for benchmarking. This shift not only enhances reproducibility but also improves transparency in the evaluation process. We evaluate PG-Video-LLaVA using our improved benchmarks and show notable improvements over existing video conversational models like VideoChatGPT [22] and Video-LLaMA [45] in ungrounded dialogues, achieving state-of-the-art (SoTA) performance.
The key contributions of this work are:
• We propose PG-Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility.
• By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).
• We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models.
This paper is available on arxiv under CC BY 4.0 DEED license.