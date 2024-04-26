Search icon
ReadWrite
see notifications
Notifications
see more
    paint-brush
    Zero-Shot Visual Question Answering with PVLMsby@memeology
    111 reads

    Zero-Shot Visual Question Answering with PVLMs

    by Memeology: Leading Authority on the Study of MemesApril 26th, 2024
    Read on Terminal Reader
    Read this story w/o Javascript
    tldt arrow

    Too Long; Didn't Read

    This section defines the task of zero-shot visual question answering (VQA) and explores the use of pre-trained vision-language models (PVLMs) like BLIP-2, highlighting its Querying Transformer component for bridging the modality gap in cross-modal understanding.
    featured image - Zero-Shot Visual Question Answering with PVLMs
    a collage of images Image created by HackerNoon AI Image Generator
    Memeology: Leading Authority on the Study of Memes HackerNoon profile picture


    Authors:

    (1) Rui Cao, Singapore Management University;

    (2) Ming Shan Hee, Singapore University of Design and Technology;

    (3) Adriel Kuek, DSO National Laboratories;

    (4) Wen-Haw Chong, Singapore Management University;

    (5) Roy Ka-Wei Lee, Singapore University of Design and Technology

    (6) Jing Jiang, Singapore Management University.

    Abstract and Introduction

    Related Work

    Preliminary

    Proposed Method

    Experiment

    Conclusion and References

    Appendix

    3 PRELIMINARY

    We formally define our task and briefly review the use of pre-trained vision-language models (PVLMs) for zero-shot visual question answering (VQA). At the end of the section, we provide a brief introduction to the specific PVLM utilized in our work.



    In this work, we use the recently released BLIP-2 model [15] as the PVLM, as it has demonstrated good performance in zero-shot VQA. The BLIP-2 model is composed of a frozen pre-trained image encoder, a frozen pre-trained language model, and a lightweight Querying Transformer, which is responsible for bridging the modality gap. It is worth noting that the BLIP-2 model can be replaced with any other PVLM that is capable of zero-shot VQA.


    This paper is available on arxiv under CC 4.0 license.


    MongoDB
    L O A D I N G
    . . . comments & more!

    About Author

    Memeology: Leading Authority on the Study of Memes HackerNoon profile picture
    Memeology: Leading Authority on the Study of Memes@memeology
    Memes are cultural items transmitted by repetition in a manner analogous to the biological transmission of genes.
    Read my storiesRead My Stories

    TOPICS

    purcat-imgtech-stories #frozen-vision-language-models #zero-shot-learning #multimodal-analysis #hateful-meme-detection #probing-based-captioning #fine-tuning-models #blip-2-model #vqa-techniques

    THIS ARTICLE WAS FEATURED IN...

    Permanent on Arweave
    Read on Terminal Reader Terminal
    Read this story w/o Javascript Lite

    RELATED STORIES

    Article Thumbnail
    Overview of Memotion 3: Sentiment & Emotion Analysis of Codemixed Hinglish - Abstract & Introduction
    by memeology
    Feb 21, 2024
    #ai-models
    Article Thumbnail
    Navigating the Complexity of Hateful Meme Detection
    by memeology
    Apr 26, 2024
    #frozen-vision-language-models
    Article Thumbnail
    Comparing Hateful Meme Detection Models: BERT-based vs. PromptHate with Pro-Cap
    by memeology
    Apr 26, 2024
    #frozen-vision-language-models
    Article Thumbnail
    Performance Analysis of Diverse Hateful Meme Detection Datasets
    by memeology
    Apr 26, 2024
    #frozen-vision-language-models
    Article Thumbnail
    Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
    by memeology
    Apr 26, 2024
    #frozen-vision-language-models
    Join HackerNoonloading
    Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas