paint-brush
Zero-Shot Visual Question Answering with PVLMsby@memeology
109 reads

Zero-Shot Visual Question Answering with PVLMs

tldt arrow

Too Long; Didn't Read

This section defines the task of zero-shot visual question answering (VQA) and explores the use of pre-trained vision-language models (PVLMs) like BLIP-2, highlighting its Querying Transformer component for bridging the modality gap in cross-modal understanding.
featured image - Zero-Shot Visual Question Answering with PVLMs
Memeology: Leading Authority on the Study of Memes HackerNoon profile picture


Authors:

(1) Rui Cao, Singapore Management University;

(2) Ming Shan Hee, Singapore University of Design and Technology;

(3) Adriel Kuek, DSO National Laboratories;

(4) Wen-Haw Chong, Singapore Management University;

(5) Roy Ka-Wei Lee, Singapore University of Design and Technology

(6) Jing Jiang, Singapore Management University.

Abstract and Introduction

Related Work

Preliminary

Proposed Method

Experiment

Conclusion and References

Appendix

3 PRELIMINARY

We formally define our task and briefly review the use of pre-trained vision-language models (PVLMs) for zero-shot visual question answering (VQA). At the end of the section, we provide a brief introduction to the specific PVLM utilized in our work.



In this work, we use the recently released BLIP-2 model [15] as the PVLM, as it has demonstrated good performance in zero-shot VQA. The BLIP-2 model is composed of a frozen pre-trained image encoder, a frozen pre-trained language model, and a lightweight Querying Transformer, which is responsible for bridging the modality gap. It is worth noting that the BLIP-2 model can be replaced with any other PVLM that is capable of zero-shot VQA.


This paper is available on arxiv under CC 4.0 license.