A Summarize-then-Search Method for Long Video Question Answering: Related Work

by Kinetograph: The Video Editing Technology PublicationMay 26th, 2024

Too Long; Didn't Read

In this paper, researchers explore zero-shot video QA using GPT-3, outperforming supervised models, leveraging narrative summaries and visual matching.

featured image - A Summarize-then-Search Method for Long Video Question Answering: Related Work

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);

(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).

Table of Links

Movie Summarization Movies are typical examples of long videos with clear narrative structures. Gorinski et al. [7]generate the shorter version of a screenplay as the task of finding an optimal graph chain of a movie scene. TRIPOD [23] is a screenplay dataset containing turning point annotations. In the same work, an automatic model to identify the turning point from movie narratives is proposed. Papalampidi et al. [24] later uses the TV series CSI to demonstrate the usefulness of turning points in automatic movie summarization. Lee et al. [15] further improves turning point identification with dialogue features and transformer architecture.

Long Video QA The task of video question answering has been studied extensively in the literature in the form of both Open-Ended QA [9] and Multi-Choice Problems [28, 29]. Several approaches have been proposed to address this task, starting from RNN-based attention networks [9, 30, 36, 38], to memory networks [12, 22, 27], and transformers [4, 6]. Recently, multimodal models pre-trained on large-scale video datasets (VideoQA [31], VIOLET [5], and MERLOT [33] and MERLOT-Reserve [34]) shows promising performance in video question answering as well.

However, long video QA has received relatively less attention despite its importance. MovieQA [27] formulates QAs on the entire movies, which typically span two long hours. DramaQA [3] uses a single TV series as visual context, and tasks a solver to understand video clips of length from one to twenty minutes.

L O A D I N G
. . . comments & more!

About Author

Kinetograph: The Video Editing Technology Publication@kinetograph

The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover cutting edge tech for video editing.

Read my stories Learn More

TOPICS

science #narrative-video-qa #long-story-short-framework #large-language-models #multimodal-narratives #zero-shot-reasoning #gpt-3 #clipcheck #long-story-short

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas