Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 8 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
For PG-Video-LLaVA, zero-shot question-answering (QA) capabilities were evaluated quantitatively using several established open-ended QA datasets: MSRVTT-QA[40], MSVD-QA [39], TGIF-QA [16], and ActivityNet-QA [44]. These datasets are benchmarks for assessing a model’s ability to generate accurate answers without any datasetspecific fine-tuning. We adopted a zero-shot evaluation methodology, utilizing Vicuna-13b-v1.5 to evaluate the model’s understanding and predictive accuracy, with scores
assigned on a scale from 1 to 5. Results are presented in Table 3.
In comparison to Video-ChatGPT, PG-Video-LLaVA demonstrates superior performance, surpassing not only the predecessor but also other notable models in the field, such as FrozenBiLM[41] and Video Chat[15]. The results from our evaluations indicate that PG-Video-LLaVA has significantly enhanced its ability to comprehend video content and generate contextually relevant answers, thus establishing a new state-of-the-art in zero-shot VideoQA.
As shown in Figure 4, our method is able to visually ground the key objects in the given video. Improvement in the model’s capability to describe the content in the video is demonstrated in Figure 3. Further, it can be observed that adding the audio modality helps make correct outputs, whereas the model without audio modality fails to capture those details from visual content alone (Figure 5).
This paper is available on arxiv under CC BY 4.0 DEED license.