Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Works


PG-Video-LLaVA


3.1. Overview
3.2. Architecture


Experiments


4.1. Implementation Details
4.2. Stronger Baseline
4.3. Spatial Grounding in Videos
4.4. Zero-Shot Visual Question Answering
5. Conclusion and References Supplementary Material A. Audio Modality Integration
B. Visual Grounding: Quantitative Evaluation
C. Qualitative Results for Visual Grounding
D. Quantitative Evaluations of Video-based Conversation Performance 4.3. Spatial Grounding in Videos To quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PGVideo-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG [48] and HC-STVG [34] datasets. Due to the novelty of integrating spatial grounding within video-conversational models, we highlight the modular nature of our grounding pipeline, which can be incorporated with other state-of-theart video conversation models. For the VidSTG dataset, we selectively processed interrogative prompts to assess the grounding accuracy. The model generates descriptive textual responses to these prompts, from which Vicuna-13bv1.5 extracts relevant referring expressions. These expressions are then spatially grounded in the video frames using our grounding pipeline. For the HC-STVG dataset, interrogative prompts are first mined from the text captions using Vicuna and then used similarly to VidSTG prompts. The results shown in Table 2 position PG-Video-LLaVA alongside alternative methods using the same benchmarks, demonstrating our model’s enhanced ability to accurately answer questions, thereby leading to improved spatial grounding performance. The qualitative results shown in Figure 4 emphasize the model’s refined spatial grounding precision. The accurate overlay of masks on the subjects within the videos confirms the model’s adeptness at correlating textual descriptors with visual elements, a critical aspect of contextual comprehension. This refined ability is crucial for applications that integrate visual data with language, improving the model’s utility in environments that demand rich, interactive visual and linguistic processing. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2. Related Works PG-Video-LLaVA PG-Video-LLaVA PG-Video-LLaVA 3.1. Overview 3.1. Overview 3.2. Architecture 3.2. Architecture Experiments Experiments Experiments 4.1. Implementation Details 4.1. Implementation Details 4.2. Stronger Baseline 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References 5. Conclusion and References Supplementary Material Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance A. Audio Modality Integration A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance D. Quantitative Evaluations of Video-based Conversation Performance 4.3. Spatial Grounding in Videos To quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PGVideo-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG [48] and HC-STVG [34] datasets. Due to the novelty of integrating spatial grounding within video-conversational models, we highlight the modular nature of our grounding pipeline, which can be incorporated with other state-of-theart video conversation models. For the VidSTG dataset, we selectively processed interrogative prompts to assess the grounding accuracy. The model generates descriptive textual responses to these prompts, from which Vicuna-13bv1.5 extracts relevant referring expressions. These expressions are then spatially grounded in the video frames using our grounding pipeline. For the HC-STVG dataset, interrogative prompts are first mined from the text captions using Vicuna and then used similarly to VidSTG prompts. The results shown in Table 2 position PG-Video-LLaVA alongside alternative methods using the same benchmarks, demonstrating our model’s enhanced ability to accurately answer questions, thereby leading to improved spatial grounding performance. The qualitative results shown in Figure 4 emphasize the model’s refined spatial grounding precision. The accurate overlay of masks on the subjects within the videos confirms the model’s adeptness at correlating textual descriptors with visual elements, a critical aspect of contextual comprehension. This refined ability is crucial for applications that integrate visual data with language, improving the model’s utility in environments that demand rich, interactive visual and linguistic processing. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

AI Model Developed by UAE Researchers Able to Identify Objects in Videos

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps