Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance 3.1. Overview In this paper, we introduce PG-Video-LLaVA, a novel Large Multimodal Model (LMM) designed to align video and audio representations with a Large Language Model (LLM). This integration equips PG-Video-LLaVA with the capability to proficiently manage both video and audio data in conversational contexts. Additionally, our method integrates a specialized plug-and-play module for effective video grounding (see Figure 2). In constructing PG-Video-LLaVA, our approach integrates sophisticated mechanisms for aligning video and audio signals with language processing capabilities, thereby facilitating a comprehensive multimodal analysis. Central to our model is an advanced CLIP-based video encoder, which has been specifically adapted to process both spatial and temporal dimensions of video data. This adaptation enables a deeper understanding of video content, setting PGVideo-LLaVA apart from conventional image-centric models. For training, PG-Video-LLaVA utilizes the VideoInstruct100K [22] dataset comprising 100K video instructions derived from ActivityNet-200 [11]. This diverse dataset ensures that the model is well-equipped to handle a broad spectrum of video contexts with high accuracy. In addition to visual processing, PG-Video-LLaVA incorporates stateof-the-art audio analysis by leveraging advanced audio transcription techniques, similar to those employed in WhisperX [2] and Whisper-AT[10]. This integration allows the model to process and understand audio inputs effectively, enhancing its overall multimodal interpretation capabilities. While PG-Video-LLaVA’s foundation is based on the LLaVA-1.5 [18] framework, it is extended for videos to incorporate spatio-temporal representations, audio understanding and visual grounding capabilities. Its unique combination of enhanced video encoding, extensive training dataset, integrated audio processing and grounding capability marks it as a forward step in the field of LMMs. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 3 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2. Related Works PG-Video-LLaVA PG-Video-LLaVA PG-Video-LLaVA 3.1. Overview 3.1. Overview 3.2. Architecture 3.2. Architecture Experiments Experiments Experiments 4.1. Implementation Details 4.1. Implementation Details 4.2. Stronger Baseline 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References 5. Conclusion and References Supplementary Material Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance A. Audio Modality Integration A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance D. Quantitative Evaluations of Video-based Conversation Performance 3.1. Overview In this paper, we introduce PG-Video-LLaVA, a novel Large Multimodal Model (LMM) designed to align video and audio representations with a Large Language Model (LLM). This integration equips PG-Video-LLaVA with the capability to proficiently manage both video and audio data in conversational contexts. Additionally, our method integrates a specialized plug-and-play module for effective video grounding (see Figure 2). In constructing PG-Video-LLaVA, our approach integrates sophisticated mechanisms for aligning video and audio signals with language processing capabilities, thereby facilitating a comprehensive multimodal analysis. Central to our model is an advanced CLIP-based video encoder, which has been specifically adapted to process both spatial and temporal dimensions of video data. This adaptation enables a deeper understanding of video content, setting PGVideo-LLaVA apart from conventional image-centric models. For training, PG-Video-LLaVA utilizes the VideoInstruct100K [22] dataset comprising 100K video instructions derived from ActivityNet-200 [11]. This diverse dataset ensures that the model is well-equipped to handle a broad spectrum of video contexts with high accuracy. In addition to visual processing, PG-Video-LLaVA incorporates stateof-the-art audio analysis by leveraging advanced audio transcription techniques, similar to those employed in WhisperX [2] and Whisper-AT[10]. This integration allows the model to process and understand audio inputs effectively, enhancing its overall multimodal interpretation capabilities. While PG-Video-LLaVA’s foundation is based on the LLaVA-1.5 [18] framework, it is extended for videos to incorporate spatio-temporal representations, audio understanding and visual grounding capabilities. Its unique combination of enhanced video encoding, extensive training dataset, integrated audio processing and grounding capability marks it as a forward step in the field of LMMs. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv