Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 4 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
In our architectural design, the spatio-temporal feature extraction is inspired by Video-ChatGPT [22], with an additional enhancement of employing a higher resolution of 336×336 pixels to encode frame-level features.
36×336 pixels to encode frame-level features. Within the architecture of PG-Video-LLaVA, we have implemented a learnable Multi-Layer Perceptron (MLP), designated as g, to serve as our cross-modal connector. This MLP is intricately designed to project video-level features into the embedding space of the language decoder. This is inspired from LLaVA-1.5 [18], aiming to optimize the model’s multi-modal capabilities beyond what could be achieved with a simple linear projection. The process yields language embedding tokens Qv, calculated as follows:
In PG-Video-LLaVA, we have integrated an audio processing pipeline that significantly enhances the video-question answering capabilities by incorporating audio cues from the input, drawing inspiration from the architecture of WhisperX[2]. The process begins with the deployment of a Voice Activity Detection (VAD) model. This model is crucial for pinpointing speech-containing temporal segments within the audio track. Following the VAD’s identification of speech segments, these segments undergo processing—cutting, merging, and padding—to align with the input specifications of the Whisper model [24]. Simultaneously, a phoneme segmentation model operates in parallel, producing phone-level segmentations essential for the subsequent alignment of raw transcriptions with the audio.
The VAD model serves a dual purpose: it not only identifies speech segments but also aids in filtering out nonspeech audio components. To enhance the compatibility of transcriptions generated by Whisper with our model, we integrate Whisper-AT[10]. This advanced version of the Whisper model specializes in audio tagging. It annotates the audio stream with labels from an extensive set of 527 audio event classes, allowing for precise temporal resolution.
The audio transcripts are then subjected to a multistage filtering process. Initially, a VAD-based filter is applied, followed by a phoneme-based forced alignment using the Whisper model, ensuring temporally accurate text transcriptions. Utilizing Whisper’s language identification feature, we eliminate non-English speech segments at this stage. For each identified sentence segment, we apply Whisper-AT [10] for audio tagging, focusing on the top three predicted audio classes. Segments that do not predominantly feature ‘speech’, or where ‘music’ probabilities significantly exceed ‘speech’, are excluded from further processing.
Finally, the integration of the audio transcript with the video component is executed through a carefully designed prompt template. This template is pivotal in guiding the system to understand user instructions, assimilate the video frames, and incorporate the transcriptions generated by the automatic speech recognition model. This structured approach ensures that PG-Video-LLaVA efficiently leverages all available modalities—visual and auditory—thereby enabling users to achieve task completion and query resolution based on a comprehensive analysis of both visual and auditory content (refer to Figure 2 for details).
In PG-Video-LLaVA, our spatial grounding approach starts with processing video-question pairs to generate textual descriptions. These descriptions are then used for grounding within the video frames. Key noun phrases are extracted from the generated text using Vicuna, targeting the most critical content aspects. Simultaneously, an image tagging model, RAM [47], tags visual elements in each frame, creating a detailed map of the video content.
The video is segmented into smaller parts using PySceneDetect [1], based on changes in scene composition. This segmentation facilitates a more focused grounding process. In each segment, our grounding ensemble, composed of GroundingDINO [20], DEVA [6], and SAM [12], utilizes the image tags to create segmentation masks and tracking IDs for the identified visual elements.
The visual cues from these segmentation masks are then matched with the textual noun phrases using CLIP [30]. This matching process links text to the corresponding visual elements in the video, enhancing our understanding of the content.
In quantitative analysis, from the descriptive textual response to an interrogative text, a referring expression or a phrase is extracted using Vicuna. This phrase is input into our grounding module, which then generates segmentation masks and tracking IDs. We measure the spatial grounding accuracy of our model by calculating the Intersection over Union (IoU) between these segmentation masks and ground truth bounding boxes.
This systematic approach enables PG-Video-LLaVA to effectively ground textual descriptions within video content, thereby improving the performance and interpretability of video-question answering systems.
This paper is available on arxiv under CC BY 4.0 DEED license.