This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Shreyash Mishra has an equal contribution from IIIT Sri City, India;
(2) S Suryavardan has an equal contribution from IIIT Sri City, India;
(3) Megha Chakraborty, University of South Carolina, USA;
(4) Parth Patwa, UCLA, USA;
(5) Anku Rani, University of South Carolina, USA;
(6) Aman Chadha, work does not relate to a position at Amazon from Stanford University, USA, or Amazon AI, USA;
(7) Aishwarya Reganti, CMU, USA;
(8) Amitava Das, University of South Carolina, USA;
(9) Amit Sheth, University of South Carolina, USA;
(10) Manoj Chinnakotla, Microsoft, USA;
(11) Asif Ekbal, IIT Patna, India;
(12) Srijan Kumar, Georgia Tech, USA.
Conclusion and Future Work and References
There were 47 team registrations for the task in the Memotion 3.0 task page, of which 5 teams made submissions for the final test set of the dataset. The results for all three tasks are given in the following section and an overview of the 4 teams that presented their description papers are provided below.
Table 1
Leaderboard of teams on Task A: Sentiment Analysis.
wentaorub [68] use CLIP [69] for individual text and image embeddings, before concatenating them and passing them through multi-headed attention layers for classification. They also use the OSCAR [70] model in this approach and an ensemble of their models are used for the final submission. Datasets such as Facebook Hateful memes [51], MMHS150k [49] etc. are used for pre-training. This architecture helped them achieve the best performance in Task B and C.
NYCU_TWO [71] propose a two model pipelines, namely Coopoerative Teaching Model (CTM) for task A and Cascaded Emotion Classifier (CEC) for task B and C. A fusion of multimodal embeddings from pre-trained Swin-Transformer [72] and CLIP are passed to the CTM and CEC pipelines. CEC helps leverage task C predictions for task B by jointly training the model. This team attained the best results in 3 out of the 4 labels in Task B.
NUAA-QMUL-AIIT [73] refer to their approach as Squeeze-and-Excitation Fusion or SEFusion. The textual features from pre-trained RoBERTa [74] and visual features from CLIP-ViT [75] are fused to obtain multi-modal embeddings. The fusion is the SEFusion module, which uses a learned activation of the squeezed features, allowing for weighted fusion of multi-modal embeddings. This approach led to NUAA-QMUL-AIIT being the 1st ranked team in Task A.
CUFE used a LightGBM [76] classifier for classification on every individual emotion or label in all Tasks. The inputs to the classifier were pre-trained Hinglish-DistilBERT for text embeddings and ResNet18 [77] for image embeddings. Other features, such as occurrence of characters, word count etc. from text and number of faces in the memes (extracted using Facenet’s [78] multitask cascaded CNN), were also used. CUFE obtained the highest score in 2 labels in Task B and 1 label in Task C.