Authors:
(1) Rui Cao, Singapore Management University;
(2) Ming Shan Hee, Singapore University of Design and Technology;
(3) Adriel Kuek, DSO National Laboratories;
(4) Wen-Haw Chong, Singapore Management University;
(5) Roy Ka-Wei Lee, Singapore University of Design and Technology
(6) Jing Jiang, Singapore Management University.
We formally define our task and briefly review the use of pre-trained vision-language models (PVLMs) for zero-shot visual question answering (VQA). At the end of the section, we provide a brief introduction to the specific PVLM utilized in our work.
In this work, we use the recently released BLIP-2 model [15] as the PVLM, as it has demonstrated good performance in zero-shot VQA. The BLIP-2 model is composed of a frozen pre-trained image encoder, a frozen pre-trained language model, and a lightweight Querying Transformer, which is responsible for bridging the modality gap. It is worth noting that the BLIP-2 model can be replaced with any other PVLM that is capable of zero-shot VQA.
This paper is available on arxiv under CC 4.0 license.