Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University.
Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below.
Supplementary Material
For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments.
Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49].
This paper is available on arxiv under CC BY 4.0 DEED license.