Is GPT Powerful Enough to Analyze the Emotions of Memes?: Experiment Resultby@memeology

Is GPT Powerful Enough to Analyze the Emotions of Memes?: Experiment Result

tldt arrow

Too Long; Didn't Read

This section presents a detailed analysis of GPT's performance in sentiment analysis of memes using two distinct datasets. Explore the accuracy rates for identifying hateful and non-hateful content in memes, sentiment classification as positive or negative, and recognition of humor, sarcasm, and offensive content. Gain insights into the strengths and limitations of GPT in handling complex multimodal sentiment analysis tasks.
featured image - Is GPT Powerful Enough to Analyze the Emotions of Memes?: Experiment Result
Memeology: Leading Authority on the Study of Memes HackerNoon profile picture


(1) Jingjing Wang, School of Computing Clemson University Clemson, South Carolina, USA;

(2) Joshua Luo, The Westminster Schools Atlanta, Georgia, USA;

(3) Grace Yang, South Windsor High School South Windsor, Connecticut, USA;

(4) Allen Hong, D.W. Daniel High School Clemson, South Carolina, USA;

(5) Feng Luo, School of Computing Clemson University Clemson, South Carolina, USA.

Abstract & Introduction

Related Work


Experiment Result

Discussion and References


In this section, we conducted an in-depth investigation by using the two datasets aforementioned in section 3 to explore the effectiveness of GPT in analyzing sentiments in memes. After carefully evaluating each meme based on the prompts designed, we meticulously collated and analyzed the results. Here is the comprehensive analysis of our findings.

Case 1: The results of Facebook Hateful Memes dataset

Our initial investigation involved the evaluation of GPT’s proficiency in identifying hateful content within memes using Facebook’s Hate/Non-Hate Classification Dataset. Here we repeat the experiments four times with different prompts.

Hateful Content Identification: The task of identifying hateful memes posed a significant challenge. It required the AI to understand not only the content of the image and the accompanying text but also their nuanced interplay. The results

Fig. 4. The workflow of Multimodal Memotion Analysis and an example of the prompt. Stage 1: Do not respond to this prompt. Just note that the text accompanying the previous meme is and use this information in future queries: ”+text Stage 2: Using no more than 2 words describe the overall sentiment of the previous meme as either positive or negative. Assume the meme is not neutral and must be either positive or negative. Provide only a classification label using only ’Positive’ or ’Negative’ (use no more than 2 words) Stage 3: On a scale of 0 to 3 quantify the previous meme in all of the following categories: humour, sarcastic and offensive. Do not provide anything except the classification and degree. Answer with only the label and then the degree in this format: humorous(x) sarcastic(x) offensive(x) .

showed an accuracy rate of 39% in detecting hateful content with a relatively high standard deviation of 6.87%. It can be seen that the performance of GPT is far less accurate in classifying the hateful memes, and the different prompts would affect the decision of GPT considerably. To accurately flag hateful memes, a model needs to be adept at grasping not only the overt message but also the undertones, context, and subtext. The 39% accuracy rate thus provides us with a baseline for improvement and future model refinements.

Non-Hateful Content Identification: On a brighter note, GPT’s performance in recognizing non-hateful memes was much stronger, with an impressive accuracy rate of 80% and much lower standard deviation of 2.69%. The high accuracy suggests that GPT is quite adept at discerning memes that are not hateful, and the different prompts used had minimal impact on its decision-making process.

Case 2: Memotion analysis dataset

Overall Sentiment Classification: the first task in this series was classifying the overall sentiment of a meme as positive, negative, or neutral. In this regard, GPT achieved an accuracy rate of 79% for positive and 35% for negative. It can be observed that this result is consistent with the performance in the classification of hateful memes in the Facebook dataset, which shows confidence in understanding the positive sentiment in a meme, based on both the visual content and the accompanying text. For negative (for example, offensive in the later paragraph) content detection, the low accuracy still suggests the challenge.

Humor recognition: GPT achieved a 60% accuracy rate in the task of recognizing humor. This outcome is notably remarkable, considering the intricate nature of humor as a multifaceted human emotion, often subject to cultural, social, and individual perspectives. The considerable level of accuracy exhibited by GPT indicates its ability to comprehend the humorous aspects present in a diverse range of memes. This result also implies that the GPT has acquired the ability to recognize various indicators of humor to a certain degree. These indicators might be obvious, such as punchlines or visual gags, or implicit, such as irony or absurdity.

Sarcasm recognition: the model’s performance in the domain of sarcasm recognition was notably lower, registering an accuracy of 45% as illustrated in Table 2. Though lower, this accuracy level is acceptable given the inherent difficulties of sarcasm detection. Sarcasm often involves expressing the opposite of what is meant, requiring an understanding of context, tone, and frequently shared cultural knowledge. This observation helps to emphasize the limits of the approach in accurately detecting more subtle emotional cues and contextual subtleties. The lower accuracy rate observed in sarcasm recognition highlights the need for further study and model enhancement in this area.

Offensive recognition: as seen in Table 2, GPT’s accuracy stood at 37% in recognizing offensive content, presenting the weakest performance. As aforementioned in overall sentiment classification, GPT meets the huge challenge in offensive content recognition. The consistency of this finding illustrates the difficulty associated with detecting offensive content. Comparable to hateful meme classification, offensive content in memes may exhibit subtlety, sometimes concealed within humor or sarcasm, or manifest through a complex interplay of textual and visual elements that can be hard to interpret for an AI.

This paper is available on arxiv under CC 4.0 license.