A Summarize-then-Search Method for Long Video Question Answerin Experiment Details

by Kinetograph: The Video Editing Technology PublicationMay 26th, 2024

Too Long; Didn't Read

In this paper, researchers explore zero-shot video QA using GPT-3, outperforming supervised models, leveraging narrative summaries and visual matching.

featured image - A Summarize-then-Search Method for Long Video Question Answerin Experiment Details

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);

(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).

Table of Links

A. Experiment Details

Computational Budget. Long Story Short uses GPT-3 (175B parameters) via OpenAI API as the backbone. An average prompt to summarize a video segment processes ∼ 3000 tokens, while a QA prompt usually takes ∼ 4000 tokens. For CLIPCheck, we extract CLIP features and compute the cosine similarity using a single NVIDIA A6000 GPU: it takes 0.5 hours to process video frames for the MovieQA validation split.

Hyperparameters. All hyperparameters are pre-defined by analyzing a single training sample. For narrative search, we use sentence similarity threshold α ≥ 0.5 to find plot pieces when GPT-3 does not output a single index. We use the binary entropy threshold E ′ ≥ 0.4 in CLIPCheck. We run each experiment only once, as our method is deterministic and is not susceptible to randomness in initialization.

Video Segmentation Scheme. There are predefined segment boundary annotations for all datasets we utilize in this paper. Also, all plot pieces have aligned clip segments in turn since we perform summarization on each clip segmented with the predefined boundaries. Also, before applying LSS we filter out clip segments that 1. are too short, 2. have no aligned image frame, or 3. have no text context to make sure that we can retrieve the clip segments using plot summaries.

External Libraries. We use OpenAI API to access GPT-3 language model. The CLIP features are computed with the Huggingface implementations (https://huggingface. co/docs/transformers/main/en/model_doc/clip).

L O A D I N G
. . . comments & more!

About Author

Kinetograph: The Video Editing Technology Publication@kinetograph

The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover cutting edge tech for video editing.

Read my stories Learn More

TOPICS

science #narrative-video-qa #long-story-short-framework #large-language-models #multimodal-narratives #zero-shot-reasoning #gpt-3 #clipcheck #long-story-short

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave

Terminal

Lite

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas