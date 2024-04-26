



Authors: (1) Rui Cao, Singapore Management University; (2) Ming Shan Hee, Singapore University of Design and Technology; (3) Adriel Kuek, DSO National Laboratories; (4) Wen-Haw Chong, Singapore Management University; (5) Roy Ka-Wei Lee, Singapore University of Design and Technology (6) Jing Jiang, Singapore Management University.

Abstract and Introduction

Related Work

Preliminary

Proposed Method

Experiment

Conclusion and References

Appendix

3 PRELIMINARY

We formally define our task and briefly review the use of pre-trained vision-language models (PVLMs) for zero-shot visual question answering (VQA). At the end of the section, we provide a brief introduction to the specific PVLM utilized in our work.









In this work, we use the recently released BLIP-2 model [15] as the PVLM, as it has demonstrated good performance in zero-shot VQA. The BLIP-2 model is composed of a frozen pre-trained image encoder, a frozen pre-trained language model, and a lightweight Querying Transformer, which is responsible for bridging the modality gap. It is worth noting that the BLIP-2 model can be replaced with any other PVLM that is capable of zero-shot VQA.





