Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Works


PG-Video-LLaVA


3.1. Overview
3.2. Architecture


Experiments


4.1. Implementation Details
4.2. Stronger Baseline
4.3. Spatial Grounding in Videos
4.4. Zero-Shot Visual Question Answering
5. Conclusion and References Supplementary Material A. Audio Modality Integration
B. Visual Grounding: Quantitative Evaluation
C. Qualitative Results for Visual Grounding
D. Quantitative Evaluations of Video-based Conversation Performance 4.1. Implementation Details For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments. Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49]. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 5 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2. Related Works PG-Video-LLaVA PG-Video-LLaVA PG-Video-LLaVA 3.1. Overview 3.1. Overview 3.2. Architecture 3.2. Architecture Experiments Experiments Experiments 4.1. Implementation Details 4.1. Implementation Details 4.2. Stronger Baseline 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References 5. Conclusion and References Supplementary Material Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance A. Audio Modality Integration A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance D. Quantitative Evaluations of Video-based Conversation Performance 4.1. Implementation Details For audio transcript extraction, base Whisper model is used. Our grounding module is based on GroundingDINOT variant and CLIP ViT-B/32. For the image-tagging model we use RAM Swin-Large variant (with input size 384). DEVA Tracker is applied under online-setting in our experiments. Vicuna-13b-v1.5 model is used in performing videobased conversational benchmarking, zero-shot question answering evaluation, and extracting the key noun or referring expression from the model output in the quantitative evaluation of the spatial grounding task. Further, Vicuna-13b-v1.5 was used to implement the entity matching as in [49]. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps