Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
In this paper, we introduce PG-Video-LLaVA, a novel Large Multimodal Model (LMM) designed to align video and audio representations with a Large Language Model (LLM). This integration equips PG-Video-LLaVA with the capability to proficiently manage both video and audio data in conversational contexts. Additionally, our method integrates a specialized plug-and-play module for effective video grounding (see Figure 2).
In constructing PG-Video-LLaVA, our approach integrates sophisticated mechanisms for aligning video and audio signals with language processing capabilities, thereby facilitating a comprehensive multimodal analysis. Central to our model is an advanced CLIP-based video encoder, which has been specifically adapted to process both spatial and temporal dimensions of video data. This adaptation enables a deeper understanding of video content, setting PGVideo-LLaVA apart from conventional image-centric models.
For training, PG-Video-LLaVA utilizes the VideoInstruct100K [22] dataset comprising 100K video instructions derived from ActivityNet-200 [11]. This diverse dataset ensures that the model is well-equipped to handle a broad spectrum of video contexts with high accuracy. In addition to visual processing, PG-Video-LLaVA incorporates stateof-the-art audio analysis by leveraging advanced audio transcription techniques, similar to those employed in WhisperX [2] and Whisper-AT[10]. This integration allows the model to process and understand audio inputs effectively, enhancing its overall multimodal interpretation capabilities.
While PG-Video-LLaVA’s foundation is based on the LLaVA-1.5 [18] framework, it is extended for videos to incorporate spatio-temporal representations, audio understanding and visual grounding capabilities. Its unique combination of enhanced video encoding, extensive training dataset, integrated audio processing and grounding capability marks it as a forward step in the field of LMMs.
This paper is available on arxiv under CC BY 4.0 DEED license.