paint-brush
UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humansby@autoencoder
New Story

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

tldt arrow

Too Long; Didn't Read

Researchers in UAE have developed an AI model that can find and focus on objects in videos and beats other models in doing so.
featured image - UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Authors:

(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;

(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;

(3) Muhammad Maaz, Mohamed bin Zayed University of AI;

(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;

(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;

(6) Mubarak Shah, University of Central Florida;

(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.

Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.


Supplementary Material

4.1. Implementation Details


For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.


Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].


This paper is available on arxiv under CC BY 4.0 DEED license.