Authors: (1) Vasiliki Kougia, University of Vienna, Faculty of computer science, Vienna, Austria & UniVie Doctoral School Computer Science, Vienna, Austria; (2) Simon Fetze, University of Vienna, Faculty of computer science, Vienna, Austria; (3) Thomas Kirchmair, University of Vienna, Faculty of computer science, Vienna, Austria; (4) Erion Çano, University of Vienna, Faculty of computer science, Vienna, Austria; (5) Sina Moayed Baharlou, Boston University, Department of Electrical and Computer Engineering, Boston, MA, USA; (6) Sahand Sharifzadeh, Ludwig Maximilians University of Munich, Faculty of Computer Science, Munich, Germany; (7) Benjamin Roth, University of Vienna, Faculty of computer science, Vienna, Austria.

In order to understand memes, it is necessary to correctly interpret the image, and the text and to connect it with appropriate general background knowledge (outside of the meme). In this work, we introduced models infused with scene graphs and world knowledge retrieved from WikiData. As a foundational representation, scene graphs were automatically generated, which relate the most important objects in the meme image to each other. Typed objects from the scene graph and named entities from the text were extracted automatically and linked to WikiData. This structured information (scene graph and information from WikiData) was then serialized as a sequence of tokens and concatenated with the original text from the meme for classification with a Transformer language model. We found that adding the graph representation and knowledge from Wikidata improved performance on hateful meme detection compared to classification on text alone, and compared to a multimodal model based on pre-trained image embeddings in addition to text. We also provide a dataset with human corrections of the automatically generated graphs, and an analysis that shows that the (uncorrected) automatic graphs and the corrected ones perform similarly well for hatefulness detection with our approach.

Acknowledgments

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - RO 5127/2-1 and the Vienna Science and Technology Fund (WWTF [10.47379/VRG19008]. We thank Christos Bintsis for participating in the manual augmentation. We also thank Matthias Aßenmacher and the anonymous reviewers for their valuable feedback.

This paper is available on arxiv under CC 4.0 license.



