Authors:
(1) Vasiliki Kougia, University of Vienna, Faculty of computer science, Vienna, Austria & UniVie Doctoral School Computer Science, Vienna, Austria;
(2) Simon Fetze, University of Vienna, Faculty of computer science, Vienna, Austria;
(3) Thomas Kirchmair, University of Vienna, Faculty of computer science, Vienna, Austria;
(4) Erion Çano, University of Vienna, Faculty of computer science, Vienna, Austria;
(5) Sina Moayed Baharlou, Boston University, Department of Electrical and Computer Engineering, Boston, MA, USA;
(6) Sahand Sharifzadeh, Ludwig Maximilians University of Munich, Faculty of Computer Science, Munich, Germany;
(7) Benjamin Roth, University of Vienna, Faculty of computer science, Vienna, Austria.
Conclusion, Acknowledgments, and References
In order to understand memes, it is necessary to correctly interpret the image, and the text and to connect it with appropriate general background knowledge (outside of the meme). In this work, we introduced models infused with scene graphs and world knowledge retrieved from WikiData. As a foundational representation, scene graphs were automatically generated, which relate the most important objects in the meme image to each other. Typed objects from the scene graph and named entities from the text were extracted automatically and linked to WikiData. This structured information (scene graph and information from WikiData) was then serialized as a sequence of tokens and concatenated with the original text from the meme for classification with a Transformer language model. We found that adding the graph representation and knowledge from Wikidata improved performance on hateful meme detection compared to classification on text alone, and compared to a multimodal model based on pre-trained image embeddings in addition to text. We also provide a dataset with human corrections of the automatically generated graphs, and an analysis that shows that the (uncorrected) automatic graphs and the corrected ones perform similarly well for hatefulness detection with our approach.
This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - RO 5127/2-1 and the Vienna Science and Technology Fund (WWTF [10.47379/VRG19008]. We thank Christos Bintsis for participating in the manual augmentation. We also thank Matthias Aßenmacher and the anonymous reviewers for their valuable feedback.
[1] Aggarwal, P., Liman, M.E., Gold, D., Zesch, T.: VL-BERT+: Detecting protected groups in hateful multimodal memes. In: Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). pp. 207–214. Online (Aug 2021). https://doi.org/10.18653/v1/2021.woah-1.22
[2] Henrique Luz de Araujo, P., Roth, B.: Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection. In: Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP. pp. 75–83. Dublin, Ireland (2022)
[3] Behera, P., Mamta, Ekbal, A.: Only text? only image? or both? predicting sentiment of internet memes. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON). pp. 444–452. Indian Institute of Technology Patna, Patna, India (2020)
[4] Blaier, E., Malkiel, I., Wolf, L.: Caption enriched samples for improving hateful memes detection. In: Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9350–9358. Online and Punta
Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.738
[5] Chen, S., Aguilar, G., Neves, L., Solorio, T.: Can images help recognize entities? a study of the role of images for multimodal ner. In: Proceedings of the 2021 EMNLP Workshop W-NUT: The Seventh Workshop on Noisy User-generated Text. pp. 87–96. Online and Punta Cana, Dominican Republic (2021)
[6] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European Conference on Computer Vision. pp. 104–120. held on-line (2020)
[7] Das, A., Wahi, J.S., Li, S.: Detecting hate speech in multi-modal memes. arXivpreprint:2012.14891 (2020)
[8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. Miami Beach, FL, USA (2009)
[9] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Minneapolis, Minnesota, USA (2019)
[10] Dimitrov, D., Bin Ali, B., Shaar, S., Alam, F., Silvestri, F., Firooz, H., Nakov, P., Da San Martino, G.: Detecting propaganda techniques in memes. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 6603–6617. Online (2021). https://doi.org/10.18653/v1/2021.acl-long.516
[11] Fersini, E., Gasparini, F., Rizzi, G., Saibene, A., Chulvi, B., Rosso, P., Lees, A., Sorensen, J.: SemEval-2022 task 5: Multimedia automatic misogyny identification. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). pp. 533–549. Seattle, United States (2022)
[12] Gan, Z., Chen, Y.C., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. arXiv:2006.06195 (2020)
[13] Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Ringshia, P., Testuggine, D.: The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems 33, 2611–2624 (2020)
[14] Kougia, V., Pavlopoulos, J.: Multimodal or text? retrieval or BERT? benchmarking classifiers for the shared task on hateful memes. In: Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). pp. 220–225. Online (2021). https://doi.org/10.18653/v1/2021.woah-1.24
[15] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123(1), 32–73 (2017)
[16] Lee, R.K.W., Cao, R., Fan, Z., Jiang, J., Chong, W.H.: Disentangling hate in online memes. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 5138–5147 (2021)
[17] Li, J., Ataman, D., Sennrich, R.: Vision matters when it should: Sanity checking multimodal machine translation models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8556–8562. Online and Punta Cana, Dominican Republic (2021)
[18] Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557 (2019)
[19] Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Objectsemantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision. pp. 121–137. held on-line (2020)
[20] Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 (2019)
[21] Mathias, L., Nie, S., Mostafazadeh Davani, A., Kiela, D., Prabhakaran, V., Vidgen, B., Waseem, Z.: Findings of the WOAH 5 shared task on fine grained hateful memes detection. In: Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). pp. 201–206. Online (2021). https://doi.org/10.18653/v1/2021.woah-1.21
[22] Mozes, M., Schmitt, M., Golkov, V., Schütze, H., Cremers, D.: Scene graph generation for better image captioning? arXiv:2109.11398 (2021)
[23] Pramanick, S., Sharma, S., Dimitrov, D., Akhtar, M.S., Nakov, P., Chakraborty, T.: MOMENTA: A
multimodal framework for detecting harmful memes and their targets. In: Findings of the Association for Computational Linguistics: EMNLP 2021. pp. 4439–4455. Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.379
[24] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)
[25] Sharifzadeh, S., Baharlou, S.M., Schmitt, M., Schütze, H., Tresp, V.: Improving scene graph classification by exploiting knowledge from texts. Proceedings of the AAAI Conference on Artificial Intelligence 36(2), 2189–2197 (2022)
[26] Sharifzadeh, S., Baharlou, S.M., Tresp, V.: Classification by attention: Scene graph classification with prior knowledge. In: Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21). pp. 5025–5033. Online (2021)
[27] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv:1908.08530 (2019)
[28] Suryawanshi, S., Chakravarthi, B.R., Arcan, M., Buitelaar, P.: Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying. pp. 32–41. Marseille, France (2020)
[29] Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV). pp. 670–685 (2018)
[30] Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10685–10694. Long Beach, CA, USA (2019)
[31] Yin, Y., Meng, F., Su, J., Zhou, C., Yang, Z., Zhou, J., Luo, J.: A novel graph-based multi-modal fusion encoder for neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 3025–3035. Online and Punta Cana, Dominican Republic (2020)
[32] Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5831–5840 (2018)
[33] Zhu, R.: Enhance multimodal transformer with external label and in-domain pretrain: Hateful meme challenge winning solution. arXiv:2012.08290 (2020)
This paper is available on arxiv under CC 4.0 license.