Researchers Combine GPT-4 and Human Experts to Train AI on Visual Figurative Reasoning

Written by textmodels | Published 2025/06/18
Tech Story Tags: vision-language-models | figurative-comprehension | multimodal-entailment | visual-metaphors | explainable-ai | textual-explanations | human-ai-collaboration | figurative-language-dataset

TLDRA new paper looks at how well large AI models handle figurative language. via the TL;DR App

Authors:

(1) Arkadiy Saakyan, Columbia University ([email protected]);

(2) Shreyas Kulkarni, Columbia University;

(3) Tuhin Chakrabarty, Columbia University;

(4) Smaranda Muresan, Columbia University.

Editor's note: this is part 3 of 6 of a study looking at how well large AI models handle figurative language. Read the rest below.

3 V-FLUTE Task and Dataset

To build V-FLUTE, we start with existing multimodal figurative datasets and use human-AI collaboration frameworks with expert annotators (Chakrabarty et al., 2022; Wiegreffe et al., 2022; Liu et al., 2022) to transform them into a highquality, explainable visual entailment benchmark. These datasets cover particular phenomena such as metaphors, similes, idioms, sarcasm or humor. Each instance includes an image and a caption and the figurative phenomenon can be either in the image, the caption or in both. We transform each data into a unified format for explainable visual entailment. An overview of the dataset and our contributions can be found in Table 1. See examples from each dataset in Table 2. Below, we describe the construction of V-FLUTE for each figurative language type (metaphors & similes, idioms, sarcasm and humor).

3.1 Metaphors and Similes

Metaphors and similes are powerful rhetorical devices that can be expressed either in text or visually in an image. Visual metaphors are used as persuasive devices in various fields such as advertising (Forceville, 2002; Scott, 1994). To create visual entailment instances containing metaphors and similes in V-FLUTE, we rely on two existing resources: HAIVMet (Chakrabarty et al., 2023) and IRFL (Yosef et al., 2023). Instances taken from HAIVMet contain the metaphor/simile as a part of the premise (image), while those taken from IRFL have the metaphor/simile as a part of the hypothesis (text).

3.1.1 HAIVMet as Data Source

The HAIVMet (Chakrabarty et al., 2023) data consists of 1,193 images of visual metaphors spanning over 958 distinct linguistic metaphors. Each image is associated with a claim that can be contradicting or entailing the image. In addition, each image is associate with a visual elaboration that presents a textual description of the image (See Figure 2). This visual elaboration was used in the original paper to generate the visual metaphors (images).

Generating Textual Explanations. We augment the dataset with candidate textual explanations. We prompt ChatGPT (gpt-3.5-0914) to generate an explanation for every tuple (See Figure 2; and prompt in Appendix D.1.1).

Expert Verification. Each claim is paired with up to 5 images. However, since these images were automatically generated with DALLE-2 using the visual elaborations, not all are completely faithful. Moreover, some claims and labels were inconsistent. Finally, automatically generated LLM candidate explanations are not always correct and require refining. To tackle these issues, we employ an expert verification process involving three expert annotators with significant experience in figurative language and visual metaphor understanding. Since each claim can be paired with more than one visual metaphor, we ask annotators to select the visual metaphor most faithful to the linguistic metaphor and visual elaboration (see Image Selection in Figure 2) or select none in the rare case when none of the visual metaphors are of good quality. As a part of the same annotation round, we also ask them to verify and edit the explanation if necessary to ensure correctness and high quality. Post strict quality control, we have 857 instances.

3.1.2 IRFL as Data Source

The IRFL dataset (Yosef et al., 2023) contains 1,440 figurative expressions, each associated with 4 distinct images. One of those images represents the figurative expression (see Figure 3), and the other 3 act as distractors.

Image Selection. We automatically select images using CLIP (Radford et al., 2021). We select one of the distractor images that have the highest CLIPScore (clip-vit-base-patch16) with the corresponding entailing image to create a challenging, contradictory instance (see where an unrelated image of a house is discarded when selecting the contradiction instance in Figure 3).

Generating Textual Explanations. We prompt GPT-4 (gpt-4-vision-preview) with the ground truth label, claim, and the image to explain the relationship between the image and the claim.

Expert Verification. We recruit the same three expert annotators from HAIVMET annotations and ask them to verify the explanation is adequate and edit it when necessary. We also ask the annotator to discard rare noisy instances where the claim, image, and label do not fit. Post strict quality control, we are left with 1149 instances.

3.2 Idioms

The IRFL dataset contains idioms in addition to metaphors and similies. An identical procedure to the one described in Section 3.1.2 was used for generating V-FLUTE instances for idioms (370 examples).

3.3 Sarcasm

To create visual entailment instances containing sarcasm, we rely on the MuSE data (Desai et al., 2022). Similarly to IRFL, instances from MuSE data contain sarcasm in the hypothesis (text).

3.3.1 MuSE as Data Source

The MuSE dataset (Desai et al., 2022) consists of 3510 distinct images, the respective sarcastic claims that act as contradiction instances (see example in Figure 4), and crowd worker written explanations justifying the contradiction.

Generating Entailment Claims. Since the dataset only contains sarcastic instances, there are no claims with an entailment relationship. We generate the entailing claims by prompting GPT-4 to generate a non-sarcastic version of the claim while maintaining the user-generated informal style of the text (see the generated entailment claim in Figure 4).

Generating Textual Explanations. While the dataset already contains crowdworker-written explanations, upon inspection, they were often deemed poor quality, lacking enough details, and formulaic (e.g., see the crowdworker explanation in Figure 4). To improve their quality, we use the dataset’s existing crowdworker explanations and prompt GPT-4 to rewrite and generate high-quality candidate textual explanations given the claim and the label (see the re-written explanation in Figure 4). See the prompt in Appendix D.3.

Expert Verification. Each image is now paired with a GPT-4-generated entailing claim, an original contradicting claim, and their respective labels and explanations. The same three expert annotators checked if the generated explanations are adequate (i.e., complete, correct, and concise) and if not, edit them. Experts were also instructed to discard noisy examples, e.g. when the image does not contradict the sarcastic claim. Through strict quality control, we obtain 1,042 instances.

3.4 Humor

For multimodal humor, we rely on two datasets: MemeCap (Hwang and Shwartz, 2023) and New Yorker cartoons (Hessel et al., 2023).

3.4.1 MemeCap as Data Source

This dataset consists of memes along with their captions that describe the meme poster’s intent (see example in Figure 5). Memes frequently contain implicit, non-literal meaning (Lestari, 2019) and rely on visual metaphors (Piata, 2016), posing a challenge to VLMs.

Claim Generation. Since meme captions are not suited for an entailment task, we perform prompt GPT-4 with the caption to generate a claim from it (see example in Figure 5). We filter these set of samples further with GPT-4 by asking whether the image entails the claim and only selecting positive instances. In addition to generating claims that entail the meme, we generate counterclaims using GPT-4.

Generating Textual Explanations. We prompted GPT-4 with the ground truth label in the prompt to explain the relationship between the image and the claim. See prompts in Appendix D.4.

Expert Verification. We hire the same three expert annotators to ensure the correctness of the data. Each annotator is tasked with verifying that 1) the generated claim fits the image and 2) the explanation is correct and complete, and if not, make the necessary changes. We also ask to discard samples with inappropriate content. After careful quality control, we have 1958 instances.

3.4.2 NYCartoons as Data Source

The NYCartoons dataset (Hessel et al., 2023) contains 651 high-quality instances from the New Yorker Cartoon Caption Contest. Each instance consists of a humorous image paired with a caption and a natural language explanation justifying the implicit humor between the caption and the image. We simply use the existing data where the caption is treated as a claim entailing the humorous image paired with an explanation.

3.5 Dataset Statistics

We split our data into 4,578 training, 726 validation, and 723 testing instances. Detailed counts per phenomenon and dataset, as well as other statistics, are in Appendix A.

This paper is available on arxiv under CC BY 4.0 DEED license.


Written by textmodels | We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.
Published by HackerNoon on 2025/06/18