Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Works


PG-Video-LLaVA


3.1. Overview
3.2. Architecture


Experiments


4.1. Implementation Details
4.2. Stronger Baseline
4.3. Spatial Grounding in Videos
4.4. Zero-Shot Visual Question Answering
5. Conclusion and References Supplementary Material A. Audio Modality Integration
B. Visual Grounding: Quantitative Evaluation
C. Qualitative Results for Visual Grounding
D. Quantitative Evaluations of Video-based Conversation Performance Abstract Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixellevel grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in VideoChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. 1. Introduction Recent efforts on Large Multimodal Models (LMMs), spearheaded by GPT-4V [25], allow detailed conversations about images but generally do not scale well to videos. The magnitude of video data scales far beyond other modalities due to its massive volume on social and internet media. Furthermore, extending LMMs to videos is challenging due to their complex dynamics with long temporal context that needs to be understood accurately. Although recent approaches towards video-LMMs such as VideoChat [15], Video-LLaMA [45], and Video-ChatGPT [22] have demonstrated capabilities in video comprehension and dialogue, they lack the crucial feature of visual grounding. Visual grounding in videos aims to associate the LMM responses to specific objects within the video input. Addressing this gap, we introduce PG-Video-LLaVA, the first video-LMM capable of localizing objects appearing in LMM responses. This task leads to enhanced intractability and demonstrates deep understanding of video content. In PG-Video-LLaVA, we address the unique challenges posed by video data. The model is designed to track objects within shorter video clips that maintain consistent camera views, enabling accurate visual grounding across scenes and motions. This tracking links spatio-temporal segments directly to conversational elements, enhancing the model’s contextual understanding. A salient feature of PG-VideoLLaVA is its modular design, allowing for easy integration with existing grounding modules and the flexibility to adapt to future enhancements in visual grounding technology. Moreover, PG-Video-LLaVA enriches its capabilities by incorporating audio context. It achieves this by leveraging video audio in a form understandable to LLM, which is particularly useful in situations where the auditory information is essential to the conversation. This inclusion broadens the model’s understanding, making it more versatile in interpreting video content. Furthermore, this work introduces an improved framework for benchmarking video-based conversational models, pivoting from previous approaches [22] that predominantly used the proprietary GPT-3.5-Turbo model for evaluation. Given that GPT-3.5-Turbo is subject to changes at any time and lacks transparency due to its closed-source nature, it presents challenges in terms of reliability and reproducibility. To address this, we propose the use of Vicuna, an open-source LLM for benchmarking. This shift not only enhances reproducibility but also improves transparency in the evaluation process. We evaluate PG-Video-LLaVA using our improved benchmarks and show notable improvements over existing video conversational models like VideoChatGPT [22] and Video-LLaMA [45] in ungrounded dialogues, achieving state-of-the-art (SoTA) performance. The key contributions of this work are: • We propose PG-Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility. • By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.). • We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 1 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2. Related Works PG-Video-LLaVA PG-Video-LLaVA PG-Video-LLaVA 3.1. Overview 3.1. Overview 3.2. Architecture 3.2. Architecture Experiments Experiments Experiments 4.1. Implementation Details 4.1. Implementation Details 4.2. Stronger Baseline 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References 5. Conclusion and References Supplementary Material Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance A. Audio Modality Integration A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance D. Quantitative Evaluations of Video-based Conversation Performance Abstract Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixellevel grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in VideoChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixellevel grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in VideoChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. 1. Introduction Recent efforts on Large Multimodal Models (LMMs), spearheaded by GPT-4V [25], allow detailed conversations about images but generally do not scale well to videos. The magnitude of video data scales far beyond other modalities due to its massive volume on social and internet media. Furthermore, extending LMMs to videos is challenging due to their complex dynamics with long temporal context that needs to be understood accurately. Although recent approaches towards video-LMMs such as VideoChat [15], Video-LLaMA [45], and Video-ChatGPT [22] have demonstrated capabilities in video comprehension and dialogue, they lack the crucial feature of visual grounding. Visual grounding in videos aims to associate the LMM responses to specific objects within the video input. Addressing this gap, we introduce PG-Video-LLaVA, the first video-LMM capable of localizing objects appearing in LMM responses. This task leads to enhanced intractability and demonstrates deep understanding of video content. In PG-Video-LLaVA, we address the unique challenges posed by video data. The model is designed to track objects within shorter video clips that maintain consistent camera views, enabling accurate visual grounding across scenes and motions. This tracking links spatio-temporal segments directly to conversational elements, enhancing the model’s contextual understanding. A salient feature of PG-VideoLLaVA is its modular design, allowing for easy integration with existing grounding modules and the flexibility to adapt to future enhancements in visual grounding technology. Moreover, PG-Video-LLaVA enriches its capabilities by incorporating audio context. It achieves this by leveraging video audio in a form understandable to LLM, which is particularly useful in situations where the auditory information is essential to the conversation. This inclusion broadens the model’s understanding, making it more versatile in interpreting video content. Furthermore, this work introduces an improved framework for benchmarking video-based conversational models, pivoting from previous approaches [22] that predominantly used the proprietary GPT-3.5-Turbo model for evaluation. Given that GPT-3.5-Turbo is subject to changes at any time and lacks transparency due to its closed-source nature, it presents challenges in terms of reliability and reproducibility. To address this, we propose the use of Vicuna, an open-source LLM for benchmarking. This shift not only enhances reproducibility but also improves transparency in the evaluation process. We evaluate PG-Video-LLaVA using our improved benchmarks and show notable improvements over existing video conversational models like VideoChatGPT [22] and Video-LLaMA [45] in ungrounded dialogues, achieving state-of-the-art (SoTA) performance. The key contributions of this work are: • We propose PG-Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility. • By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.). • We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Are Video-Based Tasks the Next Big Challenge for AI Models?

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

Are Video-Based Tasks the Next Big Challenge for AI Models?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps