paint-brush
AI Model Developed by UAE Researchers Able to Identify Objects in Videosby@autoencoder
New Story

AI Model Developed by UAE Researchers Able to Identify Objects in Videos

tldt arrow

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.
featured image - AI Model Developed by UAE Researchers Able to Identify Objects in Videos
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 7 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.


Supplementary Material

4.3. Spatial Grounding in Videos

To quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PGVideo-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG [48] and HC-STVG [34] datasets. Due to the novelty


Table 1. Performance benchmarking of video-based conversational models. Comparative performance evaluation of PG-Video-LLaVA against various models using the benchmarking framework from Video-ChatGPT [22]. The metrics include correctness, detail orientation, contextual understanding, temporal understanding, and consistency. The updated assessment pipeline incorporates Vicuna-13b-v1.5 [7] for enhanced reproducibility, replacing GPT-3.5-Turbo. Results indicate that PG-Video-LLaVA achieves favourable performance across all metrics, particularly in contextual and temporal understanding, as compared to foundational models and recent advancements in the field.


Figure 3. Qualitative results comparison of Video-ChatGPT vs PG-Video-LLaVA (Ours) Qualitative analysis of video descriptions generated by Video-ChatGPT, PG-Video-LLaVA (7B), and PG-Video-LLaVA (13B) models. The evolution in model performance is evident, with enhancements in the accuracy of information, richness of descriptive detail, and alignment with the video’s context and sequence of events as we move from the baseline Video-ChatGPT to the more advanced PG-Video LLaVA (13B) model.


of integrating spatial grounding within video-conversational models, we highlight the modular nature of our grounding pipeline, which can be incorporated with other state-of-theart video conversation models. For the VidSTG dataset,


Figure 4. Qualitative Results for Video Grounding: Visual representation of the grounding capability of advanced video-conversational capabilities of PG-Video-LLaVA. The highlighted regions in each video frame indicate the model’s ability to identify and spatially locate key subjects mentioned in the textual description, such as the giraffe, the statue, and the gymnast on a balance beam.


Table 2. Performance of PG-Video-LLaVA and other models on spatial grounding task: Evaluated using the VidSTG and HCSTVG benchmarks, the results demonstrate PG-Video-LLaVA’s favorable spatial grounding capabilities, as evidenced by its ability to generate accurate descriptive responses and effectively locate referring expressions within video frames. The table shows the model’s progress, particularly in the 13B version, showcasing its performance among other SoTA video-conversational models.


we selectively processed interrogative prompts to assess the grounding accuracy. The model generates descriptive textual responses to these prompts, from which Vicuna-13bv1.5 extracts relevant referring expressions. These expressions are then spatially grounded in the video frames using our grounding pipeline. For the HC-STVG dataset, interrogative prompts are first mined from the text captions using Vicuna and then used similarly to VidSTG prompts.


The results shown in Table 2 position PG-Video-LLaVA alongside alternative methods using the same benchmarks, demonstrating our model’s enhanced ability to accurately answer questions, thereby leading to improved spatial grounding performance.


The qualitative results shown in Figure 4 emphasize the model’s refined spatial grounding precision. The accurate overlay of masks on the subjects within the videos confirms the model’s adeptness at correlating textual descriptors with visual elements, a critical aspect of contextual comprehension. This refined ability is crucial for applications that integrate visual data with language, improving the model’s utility in environments that demand rich, interactive visual and linguistic processing.


This paper is available on arxiv under CC BY 4.0 DEED license.