Authors:
(1) Vasiliki Kougia, University of Vienna, Faculty of computer science, Vienna, Austria & UniVie Doctoral School Computer Science, Vienna, Austria;
(2) Simon Fetze, University of Vienna, Faculty of computer science, Vienna, Austria;
(3) Thomas Kirchmair, University of Vienna, Faculty of computer science, Vienna, Austria;
(4) Erion Çano, University of Vienna, Faculty of computer science, Vienna, Austria;
(5) Sina Moayed Baharlou, Boston University, Department of Electrical and Computer Engineering, Boston, MA, USA;
(6) Sahand Sharifzadeh, Ludwig Maximilians University of Munich, Faculty of Computer Science, Munich, Germany;
(7) Benjamin Roth, University of Vienna, Faculty of computer science, Vienna, Austria.
Conclusion, Acknowledgments, and References
Combining text and image inputs is crucial for many tasks, e.g. for image search or (visual) question answering, relying on an image and its caption. Research has shown that adding images to text-based tasks (e.g., machine translation) improves the performance of the models [31]. However, the meaningful interpretation of text and image and, in particular, the relations between them, still remains challenging [17, 5]. Commonly used approaches rely on Transformer models that are pre-trained on image+text pairs [18, 20, 6, 12, 27, 19]. A step towards better scene understanding is to generate scene graphs [15]. Scene graphs provide structured knowledge about an image, e.g., objects, relations, and attributes. Recent works have shown that we can improve scene graph generation using message propagation between entities [32, 29], and by employing background knowledge in form of knowledge graphs [26], texts [25], or using feedback connections. In [26], the authors proposed Schemata, a scene graph generation model consisting of two parts: the backbone module and the relational reasoning component. Additionally, Schemata uses feedback connections to further encourage the propagation of higher-level, class-based knowledge to each neighbor. The backbone is pre-trained on ImageNet [8] and the whole network is fine-tuned on Visual Genome [15] on the scene graph classification task. Scene graphs can help to achieve state-of-the-art results in several visual tasks [30, 22]. Inspired by these approaches, we generate scene graphs to represent the visual information contained in memes by using the Schemata model (see Section 3).
A specific instance of vision and language tasks is memes classification. To address the need for automatic means that can detect hateful content in memes, datasets and models [28, 3, 23] were published in the last couple of years, and shared tasks [13, 21, 11] were organized to attract interest on this task. Methods that have been implemented for hateful memes detection can be grouped into three categories: 1. Unimodal methods that use either only the text or the image as input, 2. Multimodal approaches, where image embeddings from an image encoder are fed to a text model and both models are trained separately, and 3. Multimodal methods, consisting of vision+language Transformers, being pre-trained in a multimodal fashion. Current methods experiment with models from all three categories and focus on improving models from the third category by adding extra features [3, 23, 4, 16]. These features can be visual attributes extracted from CNNs [13, 16, 1] (e.g. objects, entities or demographics), representations from CLIP [24, 23], automatically generated captions [7, 4], etc.
The Hateful Memes Challenge, hosted in 2020 by Facebook, was a binary classification task of hate detection [13].[3] Kiela et al. [13] created a dataset with 10,000 memes to which they added counterfactual examples in order to make the task more challenging for unimodal approaches. They experimented with several different settings and found that multimodal methods worked best. An extended version of the Hateful Memes Challenge was included as a shared task in the Workshop on Online Abuse and Harms (WOAH) [21]. The same dataset was used but it now included new fine-grained labels for two categories: protected category and attack type. In this shared task, multimodal approaches were dominant as well. A multimodal method introduced by [13] and subsequently also used for the shared task in WOAH [14] incorporated image embeddings as inputs to a text classifier.[4] This method belongs to the second category and is an early fusion approach meaning that the image embedding and the text embedding are concatenated before feeding them to the classifier. Different types of image and text components are employed in different works. Specifically, in ImgBERT [14], first, they feed the memes images to a convolutional neural network (CNN) and extract their embeddings. Then, they provide the text of the meme as input to BERT [9] and extract the [CLS] token representation. They concatenate the [CLS] token representation with the embedding of the meme’s image and use the result as input to the classifier. During training only the text-based BERT part of ImgBERT is trained, while the image embeddings remain frozen.
Another dataset for detecting hateful content in memes is the MultiOFF dataset [28]. It contains memes that were extracted from social media during the 2016 U.S. presidential elections. The dataset was first shared on Kaggle and consisted of the image URL for each meme, its text and metadata, e.g., timestamp, author, likes, etc.[5] The authors obtained the images from the URLs and discarded any metadata. In total, this dataset contains 743 memes, which were annotated as hateful or non-hateful. In [28] the authors experimented with unimodal (text only) and multimodal (text and image) approaches, and the model with the highest F1 score was a CNN operating only on the text of the memes.
The above mentioned datasets focus on hate and offensive speech, but there are also datasets that cover other aspects of harmful content in memes. In [10], the authors focused on detecting propaganda in memes. They created and released a dataset with 950 memes extracted from Facebook groups, annotated for 22 different propaganda techniques. In their experiments they used existing unimodal and multimodal models and found that the latter, especially multimodally pre-trained Transformers perform best in their setting. Recently, a challenge called Multimedia Automatic Misogyny Identification (MAMI) focused on detecting misogyny and its exact form, i.e., stereotype, shaming, objectification and violence in memes [11]. In [23], they studied harm in memes and proposed a framework to detect harmful memes and the entities targeted. The authors also released their dataset with 7,096 memes in total about politics and COVID-19.
The existing challenge sets, resources and models show the importance of analyzing internet memes and the challenge to combine all the modalities that form a meme. However, current works focus on incorporating visual information in the form of individual features like the ones described above or automatically generated captions. We propose a novel approach to represent the visual content of memes using scene graphs, hence "translating" them into text form. Furthermore, current methods only extract entities from the images, but not from the captions or texts. We argue that often this is not sufficient (or feasible), since memes can also incorporate screenshots of text, as it is the case in the MultiOFF dataset. Hence, we approach this problem by extracting the entities from the text in order to obtain more information. We further retrieve background knowledge for each extracted entity, and show that this approach is worthwhile to explore, since it allows for a more grounded and comprehensive automatic interpretation of memes.
[3] https://www.drivendata.org/competitions/64/hateful-memes/
[4] The method was called Concat BERT in [13] and ImgBERT in [14]. Here we call it ImgBERT because we use their implementation.
[5] https://www.kaggle.com/datasets/SIZZLE/2016electionmemes
This paper is available on arxiv under CC 4.0 license.