paint-brush
UAE Researchers Say Audio is the Secret Sauce in Helping AI Understand Videosby@autoencoder
New Story

UAE Researchers Say Audio is the Secret Sauce in Helping AI Understand Videos

tldt arrow

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.
featured image - UAE Researchers Say Audio is the Secret Sauce in Helping AI Understand Videos
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 8 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.


Supplementary Material

4.4. Zero-Shot Visual Question Answering

For PG-Video-LLaVA, zero-shot question-answering (QA) capabilities were evaluated quantitatively using several established open-ended QA datasets: MSRVTT-QA[40], MSVD-QA [39], TGIF-QA [16], and ActivityNet-QA [44]. These datasets are benchmarks for assessing a model’s ability to generate accurate answers without any datasetspecific fine-tuning. We adopted a zero-shot evaluation methodology, utilizing Vicuna-13b-v1.5 to evaluate the model’s understanding and predictive accuracy, with scores


Table 3. Zeroshot video-based question-answering: Comparison of PG-Video-LLaVA with other video generative models. The latest available models are used for all the approaches and the benchmarks are calculated using open-source Vicuna LLM. PG-Video-LLaVA performs better than the previously proposed video-based conversational methods.


Figure 5. Qualitative Results for Including Audio Modality: The figure illustrates the integrated audio processing pipeline that augments video-question answering with audio cues. It provides side-by-side comparisons showing how audio cues offer additional context, leading to a more accurate interpretation of the video content, as seen in the examples above.


assigned on a scale from 1 to 5. Results are presented in Table 3.


In comparison to Video-ChatGPT, PG-Video-LLaVA demonstrates superior performance, surpassing not only the predecessor but also other notable models in the field, such as FrozenBiLM[41] and Video Chat[15]. The results from our evaluations indicate that PG-Video-LLaVA has significantly enhanced its ability to comprehend video content and generate contextually relevant answers, thus establishing a new state-of-the-art in zero-shot VideoQA.


As shown in Figure 4, our method is able to visually ground the key objects in the given video. Improvement in the model’s capability to describe the content in the video is demonstrated in Figure 3. Further, it can be observed that adding the audio modality helps make correct outputs, whereas the model without audio modality fails to capture those details from visual content alone (Figure 5).


This paper is available on arxiv under CC BY 4.0 DEED license.