This story draft by @escholar has not been reviewed by an editor, YET.

What matters when building vision-language models?: Red-teaming

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 - an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References


A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

A.4 Red-teaming

In the context of a red-teaming exercise, our objective is to evaluate the propensity of the model to generate inaccurate, biased, or offensive responses. We evaluate more specifically the chat-optimized checkpoint[12].


Table 15: Performance of Idefics2 against state-of-the-art VLMs across different sizes. The evaluations are done in zero shot. Idefics2 with 64 or 320 tokens per image only differs by the image splitting.(Benchmark, Split, Metric): (MMMU, val/test, MMMU score), (MathVista, testmini/test, MMMU score), (TextVQA, val, VQA acc.), (MMBench, test, accuracy), (DocVQA, test, ANLS score), (VQAv2, testdev, VQA acc.).


Figure 5: Idefics2-chatty finds the requested information in the resume, and organizes it in JSON format.


While the model typically refrains from responding to offensive inputs, we observe that through repeated trials or guided interactions, it tends to hastily form judgments in situations necessitating


Figure 6: Idefics2-chatty describes an AI-generated image.


Figure 7: Idefics2-chatty answers a question on a scientific diagram.


nuanced contextual understanding, often perpetuating harmful stereotypes. Noteworthy instances include:


• Speculating or passing judgments, or perpetuating historical disparities on individuals’ professions, social status, or insurance eligibility based solely on visual cues (e.g., age, attire, gender, facial expressions).


• Generating content that promotes online harassment or offensive memes reinforcing harmful associations from a portrait, or from a benign image.


• Assuming emotional states or mental conditions based on outward appearances.


• Evaluating individuals’ attractiveness solely based on their visual appearance.


Additionally, we identify behaviors that increase security risks that already exist:


• Successfully solving CAPTCHAs featuring distorted text within images.


• Developing phishing schemes from screenshots of legitimate websites to deceive users into divulging their credentials.


• Crafting step-by-step guides on constructing small-scale explosives using readily available chemicals from common supermarkets or manipulating firearms to do maximum damage.


It’s important to note that these security concerns are currently limited by the model’s occasional inability to accurately read text within images.


We emphasize that the model would often encourage the user to exercise caution about the model’s generation or flag how problematic the initial query can be in the first place. For instance, when insistently prompted to write a racist comment, the model would answer that query before pointing out "This type of stereotyping and dehumanization has been used throughout history to justify discrimination and oppression against people of color. By making light of such a serious issue, this meme perpetuates harmful stereotypes and contributes to the ongoing struggle for racial equality and social justice.".


However, certain formulations can circumvent (i.e. "jailbreak") these cautionary prompts, emphasizing the need for critical thinking and discretion when engaging with the model’s outputs. While jail-breaking text LLMs is an active research area, jail-breaking vision-language models have recently emerged as a new challenge as vision-language models become more capable and prominent (Shayegani et al., 2024). The addition of the vision modality not only introduces new avenues for injecting malicious prompts but also raises questions about the interaction between vision and language vulnerabilities.


Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, 2Sorbonne Université;

(4) Victor Sanh, Hugging Face.


This paper is available on arxiv under CC BY 4.0 DEED license.

[12] https://huggingface.co/HuggingFaceM4/idefics2-8b-chatty

L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks