Is GPT Powerful Enough to Analyze the Emotions of Memes?: Experiment Results

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jingjing Wang, School of Computing;

(2) Joshua Luo, The Westminster Schools;

(3) Grace Yang, South Windsor High School;

(4) Allen Hong, D.W. Daniel High School;

(5) Feng Luo, School of Computing.

Table of Links

Abstract & Introduction

IV. EXPERIMENT RESULTS

In this section, we conducted an in-depth investigation by using the two datasets aforementioned in section 3 to explore the effectiveness of GPT in analyzing sentiments in memes. After carefully evaluating each meme based on the prompts designed, we meticulously collated and analyzed the results. Here is the comprehensive analysis of our findings.

Case 1: The results of Facebook Hateful Memes dataset

Our initial investigation involved the evaluation of GPT’s proficiency in identifying hateful content within memes using Facebook’s Hate/Non-Hate Classification Dataset. Here we repeat the experiments four times with different prompts.

Hateful Content Identification: The task of identifying hateful memes posed a significant challenge. It required the AI to understand not only the content of the image and the accompanying text but also their nuanced interplay. The results

showed an accuracy rate of 39% in detecting hateful content with a relatively high standard deviation of 6.87%. It can be seen that the performance of GPT is far less accurate in classifying the hateful memes, and the different prompts would affect the decision of GPT considerably. To accurately flag hateful memes, a model needs to be adept at grasping not only the overt message but also the undertones, context, and subtext. The 39% accuracy rate thus provides us with a baseline for improvement and future model refinements.

Non-Hateful Content Identification: On a brighter note, GPT’s performance in recognizing non-hateful memes was much stronger, with an impressive accuracy rate of 80% and much lower standard deviation of 2.69%. The high accuracy suggests that GPT is quite adept at discerning memes that are not hateful, and the different prompts used had minimal impact on its decision-making process.

Case 2: Memotion analysis dataset

Overall Sentiment Classification: the first task in this series was classifying the overall sentiment of a meme as positive, negative, or neutral. In this regard, GPT achieved an accuracy rate of 79% for positive and 35% for negative. It can be observed that this result is consistent with the performance in the classification of hateful memes in the Facebook dataset, which shows confidence in understanding the positive sentiment in a meme, based on both the visual content and the accompanying text. For negative (for example, offensive in the later paragraph) content detection, the low accuracy still suggests the challenge.

Humor recognition: GPT achieved a 60% accuracy rate in the task of recognizing humor. This outcome is notably remarkable, considering the intricate nature of humor as a multifaceted human emotion, often subject to cultural, social, and individual perspectives. The considerable level of accuracy exhibited by GPT indicates its ability to comprehend the humorous aspects present in a diverse range of memes. This result also implies that the GPT has acquired the ability to recognize various indicators of humor to a certain degree. These indicators might be obvious, such as punchlines or visual gags, or implicit, such as irony or absurdity.

Sarcasm recognition: the model’s performance in the domain of sarcasm recognition was notably lower, registering an accuracy of 45% as illustrated in Table 2. Though lower, this accuracy level is acceptable given the inherent difficulties of sarcasm detection. Sarcasm often involves expressing the opposite of what is meant, requiring an understanding of context, tone, and frequently shared cultural knowledge. This observation helps to emphasize the limits of the approach in accurately detecting more subtle emotional cues and contextual subtleties. The lower accuracy rate observed in sarcasm recognition highlights the need for further study and model enhancement in this area.

Offensive recognition: as seen in Table 2, GPT’s accuracy stood at 37% in recognizing offensive content, presenting the weakest performance. As aforementioned in overall sentiment classification, GPT meets the huge challenge in offensive content recognition. The consistency of this finding illustrates the difficulty associated with detecting offensive content. Comparable to hateful meme classification, offensive content in memes may exhibit subtlety, sometimes concealed within humor or sarcasm, or manifest through a complex interplay of textual and visual elements that can be hard to interpret for an AI.