Zero-Shot Visual Question Answering with PVLMs

Written by memeology | Published 2024/04/26
Tech Story Tags: frozen-vision-language-models | zero-shot-learning | multimodal-analysis | hateful-meme-detection | probing-based-captioning | fine-tuning-models | blip-2-model | vqa-techniques

TLDR This section defines the task of zero-shot visual question answering (VQA) and explores the use of pre-trained vision-language models (PVLMs) like BLIP-2, highlighting its Querying Transformer component for bridging the modality gap in cross-modal understanding.via the TL;DR App

Authors:

(1) Rui Cao, Singapore Management University;

(2) Ming Shan Hee, Singapore University of Design and Technology;

(3) Adriel Kuek, DSO National Laboratories;

(4) Wen-Haw Chong, Singapore Management University;

(5) Roy Ka-Wei Lee, Singapore University of Design and Technology

(6) Jing Jiang, Singapore Management University.

Table of Links

Abstract and Introduction

Related Work

Preliminary

Proposed Method

Experiment

Conclusion and References

Appendix

3 PRELIMINARY

We formally define our task and briefly review the use of pre-trained vision-language models (PVLMs) for zero-shot visual question answering (VQA). At the end of the section, we provide a brief introduction to the specific PVLM utilized in our work.

In this work, we use the recently released BLIP-2 model [15] as the PVLM, as it has demonstrated good performance in zero-shot VQA. The BLIP-2 model is composed of a frozen pre-trained image encoder, a frozen pre-trained language model, and a lightweight Querying Transformer, which is responsible for bridging the modality gap. It is worth noting that the BLIP-2 model can be replaced with any other PVLM that is capable of zero-shot VQA.

This paper is available on arxiv under CC 4.0 license.


Written by memeology | Memes are cultural items transmitted by repetition in a manner analogous to the biological transmission of genes.
Published by HackerNoon on 2024/04/26