New Story

What Does It Take to Make AI Watch, Listen, and Understand Videos Like Us?

by Auto Encoder: How to Ignore the Signal NoiseDecember 20th, 2024

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.

featured image - What Does It Take to Make AI Watch, Listen, and Understand Videos Like Us?

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 10 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.

Supplementary Material

A. Audio Modality Integration

Here, we outline the implementation details of audio modality integration in PG-Video-LLaVA.

A.1. Audio Transcript Filtering

To generate audio transcripts, we first experimented with using the state-of-the-art Whisper [24] directly. However, the obtained transcripts were too noisy, contained hallucinations, and unwanted text such as lyrics from songs. Passing these raw audio transcripts directly to the LLM without any filtering can negatively affect the overall model’s performance. Therefore, a preprocessing method is required to filter out noisy text and keep only the parts of the audio that carry meaningful information.

The following steps combining WhisperX[2] and Whisper-AT[10] are used to refine the original Whisper transcripts to be usable as inputs to the video LMM.

We first apply VAD-based preliminary filtering to the audio, and then use the Whisper model with Phonemebased forced alignment to get temporally aligned text transcriptions.
As Whisper is able to identify the language spoken, all non-English speech can be ignored at this point since PG-Video-LLaVA generates responses in English.
For each sentence segment obtained, slice the original audio at the corresponding timestamps and pass to Whisper-AT to produce audio-tagging output.
For each sentence segment, consider the top 3 audio classes predicted. (a) If “speech” is not among the top 3 predictions, the segment is ignored. (b) If P[music] > P[speech] and P[music] − P[speech] > threshold, the segment is ignored (the threshold is set empirically to 1.1).

Figure 7 shows the effectiveness of our audio transcript preprocessing method in filtering out hallucinations, music, and garbage characters from the raw audio transcript.

A.2. Integrating Audio Transcript into the LLM

The following prompt template is used when combining the spatiotemporal video features and audio transcript with the user instruction text.

SYSTEM:

You are PG-Video-LLaVA, a large

vision-language assistant.

You are able to understand the

video content that the user

provides, and assist the user

with a variety of tasks using

natural language.

USER:

<Video-Tokens>

The noisy audio transcript

of this video is:

<Audio-Transcript>

ASSISTANT:

B. Visual Grounding: Quantitative Evaluation

B.1. Overview

We introduce novel benchmarks for quantitatively evaluating conversation-based video spatial grounding, based on two existing spatio-temporal video grounding datasets, VidSTG[48] and HC-STVG[34].

In conversation-based spatial grounding, the objective is to localize interrogative sentences with unknown objects in the given video (e.g. “What is caught by the squatting boy on the floor?” ). Unlike grounding for declarative sentences where the explicit characteristics of objects (e.g. the class “toy” and visual appearance “yellow”) are present within the sentence itself, grounding for interrogative sentences is challenging due to the fact that it can only depend on relationships between the unknown object and other objects (e.g. the action relation “caught by the squatting boy” and spatial relation “on the floor”) (Figure 6). A benchmark based on this task can be regarded as a measure of the sufficient relationship construction and cross-modal relation reasoning ability of the video-language model.

To evaluate our model for conversation-based video spatial grounding, we pass interrogative prompts to the model. It then generates descriptive textual responses to these prompts, from which Vicuna-13b-v1.5 extracts relevant referring expressions. These expressions are then passed into the GroundingDINO-based spatial grounding and tracking module. For the obtained object tracks, bounding box IoU is calculated by comparing them with the ground truth annotations.

From the two spatiotemporal grounding datasets, to form a spatial-only grounding benchmark, we crop the video in the temporal axis to contain only the segment where the target object is present, and the mean spatial IoU is reported as the metric for comparison.

It should be noted that we evaluate our model in these benchmarks only in the zero-shot setting, without any training on these datasets.

Benchmark based on the VidSTG Dataset: VidSTG dataset consists of videos paired with multiform sentences (both interrogative and declarative). To form a benchmark to quantitatively evaluate the performance of conversation-based video spatial grounding, we leverage the 5693 video and interrogative sentence pairs in its test set.
Benchmark based on HC-STVG Dataset: Unlike in VidSTG dataset, in HC-STVG dataset contains only declarative form sentences for all of its videos. Therefore interrogative sentences are first generated from the declarative text captions in 3025 samples of the test set using Vicuna-13b-v1.5 model. Then the evaluation is performed in a similar manner to VidSTG.

B.2. Generating Interrogative Statements

The original text annotations in the HC-STVG dataset are in the declarative statement format. In order to make our video prompt-based grounding evaluation pipeline, we extract interrogative statements (questions) from these text annotations using Vicuna-13b-v1.5 using the following prompt template.

SYSTEM:

You are an intelligent chatbot designed for generating question-answer pairs from sentences.

USER:

Your task is to generate a question and answer from the given sentence. The question should start with ’Who’. The question should refer to the subject of the given sentence. The answer should include the subject of the given sentence. Please generate the response in the form of a Python dictionary string with keys ’Q’ for question and ’A’ for answer. Each corresponding value should be the question and answer text respectively. For example, your response should look like this: {’Q’: ’Your question here...’, ’A’: ’Your answer here...’}. Please note that the generated question and answer should only include information from the given sentence. Please process the following sentence: The man in the suit goes to the man in white and looks at him.

ASSISTANT:

{’Q’: ’Who goes to the man in white?’, ’A’:’The man in the suit’}

USER:

ER: Please process the following sentence: <DECLARATIVE_STATEMENT>

ASSISTANT:

B.3. Extracting Referring Expression Using Vicuna

In the quantitative evaluation, we use the following prompt template with Vicuna-13b-v1.5 to extract the referring expression from the output of the video-based LMM, which is used as the input prompt to the off-the-shelf-grounding module.

SYSTEM:

You are an intelligent chatbot designed for identifying the most relevant subject/object phrases in video-based question-sentence pairs.

USER:

Your task is to compare the question with the sentence, and extract the subject or object phrase of the sentence that most accurately answers the given question. The selected phrase should be short and should contain only one noun. The selected phrase can include adjectives that explain the attributes of the subject/object.

The selected phrase should not exceed 4 words. The selected phrase should not include articles (’a’, ’the’, ’and’). Please generate the response in the form of a Python dictionary string with keys ’OBJECT’, where its value is the extracted phrase in Python string format. DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary. For example, your response should look like this: {’OBJECT’: ’green toy’}. Please process the following video-based question-answer pair:

Question: who is in front of the guitar at the show? Answer: A woman in a black dress is in front of the guitar on stage.

ASSISTANT:

{’OBJECT’: ’woman in black dress’}

USER:

Question: who points to the window?

Answer: The old man is pointing to the window.

ASSISTANT:

{’OBJECT’: ’old man’}

USER:

Question: who is inside the blue car?

Answer: The driver of the blue car.

ASSISTANT:

{’OBJECT’: ’driver’}

USER:

Please process the following video-based question-answer pair:

Question: <INPUT_TO_VIDEO_LMM>

Answer: <OUTPUT_OF_VIDEO_LMM>

ASSISTANT:

B.4. Entity Matching with Vicuna

As shown in Figure 2, our method employs an LLM-powered entity matching module similar to [49] to match the key phrases in the video-LMM’s output with the object tracks obtained from the grounding and tracking module. We use the same prompt template as [49].

C. Qualitative Results for Visual Grounding

D. Quantitative Evaluations of Video-based Conversation Performance

We leverage the video-based conversation performance benchmarks introduced in Video-ChatGPT[22], while changing the evaluation LLM from GPT-3.5-Turbo to Vicuna-13b-v1.5 model. The prompt templates used with Vicuna are as same as with [22].

Video-based Generative Performance Benchmarking: In this benchmark we continue to use the same test set of 500 samples curated from the ActivityNet-200[11] videos as in [22].

Zero-Shot Question-Answer Evaluation: Following Video-ChatGPT, we perform zero-shot evaluation on four standard open-ended question-answer datasets: MSRVTT[40], MSVD[39], TGIF[16], and ActivityNetQA[44]. No specific training is performed on these datasets, and the evaluation is performed in a zero-shot manner.