Overview of Memotion 3: Sentiment & Emotion Analysis of Codemixed Hinglish -Participating Systems

This paper is under CC 4.0 license. available on arxiv Authors: (1) Shreyash Mishra has an equal contribution from IIIT Sri City, India; (2) S Suryavardan has an equal contribution from IIIT Sri City, India; (3) Megha Chakraborty, University of South Carolina, USA; (4) Parth Patwa, UCLA, USA; (5) Anku Rani, University of South Carolina, USA; (6) Aman Chadha, work does not relate to a position at Amazon from Stanford University, USA, or Amazon AI, USA; (7) Aishwarya Reganti, CMU, USA; (8) Amitava Das, University of South Carolina, USA; (9) Amit Sheth, University of South Carolina, USA; (10) Manoj Chinnakotla, Microsoft, USA; (11) Asif Ekbal, IIT Patna, India; (12) Srijan Kumar, Georgia Tech, USA. Table of Links Abstract & Introduction Related Work Task Details Participating systems Results Conclusion and Future Work and References 4. Participating systems There were 47 team registrations for the task in the Memotion 3.0 task page, of which 5 teams made submissions for the final test set of the dataset. The results for all three tasks are given in the following section and an overview of the 4 teams that presented their description papers are provided below. Table 1 Leaderboard of teams on Task A: Sentiment Analysis. [68] use CLIP [69] for individual text and image embeddings, before concatenating them and passing them through multi-headed attention layers for classification. They also use the OSCAR [70] model in this approach and an ensemble of their models are used for the final submission. Datasets such as Facebook Hateful memes [51], MMHS150k [49] etc. are used for pre-training. This architecture helped them achieve the best performance in Task B and C. wentaorub [71] propose a two model pipelines, namely Coopoerative Teaching Model (CTM) for task A and Cascaded Emotion Classifier (CEC) for task B and C. A fusion of multimodal embeddings from pre-trained Swin-Transformer [72] and CLIP are passed to the CTM and CEC pipelines. CEC helps leverage task C predictions for task B by jointly training the model. This team attained the best results in 3 out of the 4 labels in Task B. NYCU_TWO [73] refer to their approach as Squeeze-and-Excitation Fusion or SEFusion. The textual features from pre-trained RoBERTa [74] and visual features from CLIP-ViT [75] are fused to obtain multi-modal embeddings. The fusion is the SEFusion module, which uses a learned activation of the squeezed features, allowing for weighted fusion of multi-modal embeddings. This approach led to NUAA-QMUL-AIIT being the 1st ranked team in Task A. NUAA-QMUL-AIIT used a LightGBM [76] classifier for classification on every individual emotion or label in all Tasks. The inputs to the classifier were pre-trained Hinglish-DistilBERT for text embeddings and ResNet18 [77] for image embeddings. Other features, such as occurrence of characters, word count etc. from text and number of faces in the memes (extracted using Facenet’s [78] multitask cascaded CNN), were also used. CUFE obtained the highest score in 2 labels in Task B and 1 label in Task C. CUFE