Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
To quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PGVideo-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG [48] and HC-STVG [34] datasets. Due to the novelty
of integrating spatial grounding within video-conversational models, we highlight the modular nature of our grounding pipeline, which can be incorporated with other state-of-theart video conversation models. For the VidSTG dataset,
we selectively processed interrogative prompts to assess the grounding accuracy. The model generates descriptive textual responses to these prompts, from which Vicuna-13bv1.5 extracts relevant referring expressions. These expressions are then spatially grounded in the video frames using our grounding pipeline. For the HC-STVG dataset, interrogative prompts are first mined from the text captions using Vicuna and then used similarly to VidSTG prompts.
The results shown in Table 2 position PG-Video-LLaVA alongside alternative methods using the same benchmarks, demonstrating our model’s enhanced ability to accurately answer questions, thereby leading to improved spatial grounding performance.
The qualitative results shown in Figure 4 emphasize the model’s refined spatial grounding precision. The accurate overlay of masks on the subjects within the videos confirms the model’s adeptness at correlating textual descriptors with visual elements, a critical aspect of contextual comprehension. This refined ability is crucial for applications that integrate visual data with language, improving the model’s utility in environments that demand rich, interactive visual and linguistic processing.
This paper is available on arxiv under CC BY 4.0 DEED license.