Authors:
(1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction
2. Related Works


PG-Video-LLaVA


3.1. Overview
3.2. Architecture


Experiments


4.1. Implementation Details
4.2. Stronger Baseline
4.3. Spatial Grounding in Videos
4.4. Zero-Shot Visual Question Answering
5. Conclusion and References Supplementary Material A. Audio Modality Integration
B. Visual Grounding: Quantitative Evaluation
C. Qualitative Results for Visual Grounding
D. Quantitative Evaluations of Video-based Conversation Performance 2. Related Works Recent advancements in Large Multimodal Models (LMMs) [8, 18, 50] and Large Language Models (LLMs) [7, 26, 36] have significantly transformed the artificial intelligence landscape, particularly in natural language processing and multimodal tasks. These breakthroughs have enhanced machine learning models’ ability to understand and generate human-like text, while also enabling more effective integration of various data types like images, sounds and videos with textual information. This progress represents a major leap in creating AI systems that can accurately interpret and interact with a diverse range of content. Large Language Models (LLMs): The natural language processing (NLP) field has undergone a revolution with the advent of LLMs such as GPT [4], LLaMA [36], OPT [46], and MOSS [27], particularly noted for their zero-shot learning abilities and adaptability. The development of models like InstructGPT [28] and ChatGPT [26] has further propelled advancements in conversational AI and complex query handling, chiefly through instruction tuning. Within the LLaMA framework, the emergence of opensource models such as Alpaca [35] and Vicuna [7] exemplifies how instruction tuning can significantly boost model performance. This shift towards open-source initiatives in language modeling, highlighted by models like Alpaca and Vicuna, indicates a growing trend towards more accessible and collaborative approaches in the field. In this work, we build on the open-source Vicuna LLM and extend it with multimodal capabilities. We also propose an open-source benchmark for video conversation and reasoning tasks using Vicuna LLM that is reproducible for fair evaluations. Large Multimodal Models (LMMs): The field of AI has witnessed significant advancements with the development of vision-language models like CLIP [30], renowned for their impressive zero-shot capabilities using extensive image-text pairs during training. These models have proven effective in a variety of applications, from image detection and segmentation [3, 17] to more complex tasks such as 3D modeling and video analysis [23, 31, 33, 37]. The introduction of BLIP-2 marked a pivotal transition, pioneering the integration of image features encoded by a visual encoder with text embeddings, setting the stage for the evolution into Large Multimodal Models (LMMs). This advancement influenced subsequent models like LLaVA [19], InstructBLIP [8], and MiniGPT-4 [50], which further refined image-text feature alignment and instruction tuning. VideoChat [15], Video-ChatGPT [22] and VideoLLaMA [45] represents an extension of these LMMs, moving from image-based to video-based applications, while models such as Otter [14], mPLUG-Owl [42], LLaMaAdapter [9], and InternGPT [21] continue to push the boundaries of multimodal interaction. Despite these significant strides, challenges in achieving robust visual grounding in LMMs highlight key areas for ongoing research and development in this dynamic field. Further, effective integration of audio signals within LMMs for comprehensive video understanding is an open research question that this work aims to address. Visual-Language Grounding: Grounded Large Language Models (LLMs) have made notable progress in enhancing visual and language comprehension. A diverse array of models including Kosmos-2 [29], Ferret [43], AllSeeing Model [38], LISA [13], BuboGPT [49], Shikra [5], and GLaMM [32] have employed various methodologies to master complex grounding tasks. These models demonstrate proficiency in tasks like referring expression comprehension and image segmentation, showcasing the advanced image understanding capabilities of LLMs. Methodologically, Kosmos-2, Shikra, and All-Seeing focus predominantly on creating language-based context for visual grounding. In contrast, BuboGPT merges visual elements with language, and LISA leverages vision-language embeddings for producing segmentation masks. Furthermore, GLaMM is adept at generating natural language responses linked with object segmentation masks, facilitating detailed visual-textual interactions. However, challenges remain, such as LISA’s constrained performance in multi-object scenarios and the limitations of BuboGPT and GLaMM to image-based applications, not extending to video processing. To this end, we introduce PG-Video-LLaVA, a video conversational model with pixel-level grounding capability. Further, PG-Video-LLaVA incorporates audio transcripts alongside visual and textual data, aiming to provide a more detailed understanding of video content. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Authors: Authors: (1) Shehan Munasinghe, Mohamed bin Zayed University of AI and Equal Contribution; (2) Rusiru Thushara, Mohamed bin Zayed University of AI and Equal Contribution; (3) Muhammad Maaz, Mohamed bin Zayed University of AI; (4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI; (5) Salman Khan, Mohamed bin Zayed University of AI and Australian National University; (6) Mubarak Shah, University of Central Florida; (7) Fahad Khan, Mohamed bin Zayed University of AI and Linköping University. Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Editor's Note: This is Part 2 of 10 of a study detailing the development of a smarter AI model for videos. Read the rest below. Table of Links Abstract and 1 Introduction 2. Related Works PG-Video-LLaVA 3.1. Overview 3.2. Architecture Experiments 4.1. Implementation Details 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Works 2. Related Works PG-Video-LLaVA PG-Video-LLaVA PG-Video-LLaVA 3.1. Overview 3.1. Overview 3.2. Architecture 3.2. Architecture Experiments Experiments Experiments 4.1. Implementation Details 4.1. Implementation Details 4.2. Stronger Baseline 4.2. Stronger Baseline 4.3. Spatial Grounding in Videos 4.3. Spatial Grounding in Videos 4.4. Zero-Shot Visual Question Answering 4.4. Zero-Shot Visual Question Answering 5. Conclusion and References 5. Conclusion and References Supplementary Material Supplementary Material A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance A. Audio Modality Integration A. Audio Modality Integration B. Visual Grounding: Quantitative Evaluation B. Visual Grounding: Quantitative Evaluation C. Qualitative Results for Visual Grounding C. Qualitative Results for Visual Grounding D. Quantitative Evaluations of Video-based Conversation Performance D. Quantitative Evaluations of Video-based Conversation Performance 2. Related Works Recent advancements in Large Multimodal Models (LMMs) [8, 18, 50] and Large Language Models (LLMs) [7, 26, 36] have significantly transformed the artificial intelligence landscape, particularly in natural language processing and multimodal tasks. These breakthroughs have enhanced machine learning models’ ability to understand and generate human-like text, while also enabling more effective integration of various data types like images, sounds and videos with textual information. This progress represents a major leap in creating AI systems that can accurately interpret and interact with a diverse range of content. Large Language Models (LLMs): The natural language processing (NLP) field has undergone a revolution with the advent of LLMs such as GPT [4], LLaMA [36], OPT [46], and MOSS [27], particularly noted for their zero-shot learning abilities and adaptability. The development of models like InstructGPT [28] and ChatGPT [26] has further propelled advancements in conversational AI and complex query handling, chiefly through instruction tuning. Within the LLaMA framework, the emergence of opensource models such as Alpaca [35] and Vicuna [7] exemplifies how instruction tuning can significantly boost model performance. This shift towards open-source initiatives in language modeling, highlighted by models like Alpaca and Vicuna, indicates a growing trend towards more accessible and collaborative approaches in the field. In this work, we build on the open-source Vicuna LLM and extend it with multimodal capabilities. We also propose an open-source benchmark for video conversation and reasoning tasks using Vicuna LLM that is reproducible for fair evaluations. Large Language Models (LLMs): Large Multimodal Models (LMMs): The field of AI has witnessed significant advancements with the development of vision-language models like CLIP [30], renowned for their impressive zero-shot capabilities using extensive image-text pairs during training. These models have proven effective in a variety of applications, from image detection and segmentation [3, 17] to more complex tasks such as 3D modeling and video analysis [23, 31, 33, 37]. The introduction of BLIP-2 marked a pivotal transition, pioneering the integration of image features encoded by a visual encoder with text embeddings, setting the stage for the evolution into Large Multimodal Models (LMMs). This advancement influenced subsequent models like LLaVA [19], InstructBLIP [8], and MiniGPT-4 [50], which further refined image-text feature alignment and instruction tuning. VideoChat [15], Video-ChatGPT [22] and VideoLLaMA [45] represents an extension of these LMMs, moving from image-based to video-based applications, while models such as Otter [14], mPLUG-Owl [42], LLaMaAdapter [9], and InternGPT [21] continue to push the boundaries of multimodal interaction. Despite these significant strides, challenges in achieving robust visual grounding in LMMs highlight key areas for ongoing research and development in this dynamic field. Further, effective integration of audio signals within LMMs for comprehensive video understanding is an open research question that this work aims to address. Large Multimodal Models (LMMs): Visual-Language Grounding: Grounded Large Language Models (LLMs) have made notable progress in enhancing visual and language comprehension. A diverse array of models including Kosmos-2 [29], Ferret [43], AllSeeing Model [38], LISA [13], BuboGPT [49], Shikra [5], and GLaMM [32] have employed various methodologies to master complex grounding tasks. These models demonstrate proficiency in tasks like referring expression comprehension and image segmentation, showcasing the advanced image understanding capabilities of LLMs. Methodologically, Kosmos-2, Shikra, and All-Seeing focus predominantly on creating language-based context for visual grounding. In contrast, BuboGPT merges visual elements with language, and LISA leverages vision-language embeddings for producing segmentation masks. Furthermore, GLaMM is adept at generating natural language responses linked with object segmentation masks, facilitating detailed visual-textual interactions. However, challenges remain, such as LISA’s constrained performance in multi-object scenarios and the limitations of BuboGPT and GLaMM to image-based applications, not extending to video processing. To this end, we introduce PG-Video-LLaVA, a video conversational model with pixel-level grounding capability. Further, PG-Video-LLaVA incorporates audio transcripts alongside visual and textual data, aiming to provide a more detailed understanding of video content. Visual-Language Grounding: This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Are Video-Based Tasks the Next Big Challenge for AI Models?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

12 Key Aspects for Assessing the Power of Text-to-Image Models

This New AI Doesn’t Just Watch Videos—It Listens, Learns, and Talks Back Too

UAE Researchers Create First AI That Pinpoints Objects in Videos, Down to the Pixel

AI Just Got Better at Watching Videos

UAE Researchers Teach AI to Watch, Listen, and Understand Videos Like Humans

UAE Researchers Say New AI Model Can Watch Videos, Understand Audio

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps