A search of Google Scholar for “empathetic ai” results in over 16,000 items since 2023. A search for phrases like “testing empathetic ai” and “evaluating empathetic ai” reduces this set to about 12,000 items. A lot of titles to go through! I certainly can’t claim to have read them all or even looked at every title, but here are my thoughts.
Merriam-Webster: “The action of understanding, being aware of, being sensitive to, and vicariously experiencing the feelings, thoughts, and experience of another”.
To eliminate the potential concerns with “experiencing” in the context of LLMs, I will rephrase this as, the action of understanding, being aware of, being sensitive to, and appearing to vicariously experience the feelings, thoughts, and experience of another.
And, of course, if we are concerned with conversation, we would add, And, manifesting this in such a way that the other parties in a conversation are aware of the action. Of course, a sociopath could also appear to and manifest in such a way, so I will make one final adjustment.
Empathy is:
The action of understanding, being aware of, being sensitive to in a positive manner, and appearing to vicariously experience the feelings, thoughts, and experience of another. And, manifesting this is such a way that the other parties in a conversation are aware of the action.
Reviewing this and the original definition, two components of empathy become evident, affective and cognitive.
The affective component refers to the emotional or feeling part of empathy. It’s the ability to share or mirror the feelings of another person. For example, if a friend is sad, the affective part of your empathy might make you feel sad too, or at least get a sense of their sadness.
The cognitive component, on the other hand, refers to the mental or thinking part of empathy. It’s the ability to actively identify and understand queues so that one can mentally put oneself in another person’s position. For example, if a colleague tells you about a difficult project they’re working on (a queue) in a tired voice (a queue), you might choose to try and understand their stress by actively imagining how you would feel in a similar situation. For some, this might artificially produce the affect.
At this point, most people would say that AIs don’t have feelings. Some would predict a future where AIs do have feelings and others where AIs don’t and can’t have feelings and yet a third group might say, “AIs do/will feel but in a different way than humans”.
Regardless, we will not make progress on the testing of AI for empathy if we spend time debating this topic. We must focus on our interpretation of what the AIs manifest, not their internal states. Although there has been some interesting research on this topic, see Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench.
If you can’t get over this hurdle, then I suggest you simply ignore the benchmarks on this website. However, you may still enjoy the articles and conversations!
There is a big leap between identifying something and doing something. Young athletes or scholars can identify what is wrong with their performance without being able to immediately perform at a higher level. Similarly, having the ability to identify emotions and empathetic conversations is not the same as being able to appear to have emotions and generate responses that another party would interpret as empathetic. In fact, there is even a step in between. Young athletes or scholars taking the input of a coach or teacher and in the moment producing better results, does not make them fully capable. If an AI produces an empathetic result as the side-effect of a test design or prompt, then the AI may have a nascent empathetic capability but it is not intrinsically empathetic.
Although it may not be possible to fully understand the internal state of an AI, I do believe that the identification of emotions is a necessary condition for AI to exhibit empathy. I also believe that being able to prompt/coach an AI into delivering an empathetic response is an indication of nascent capability, i.e. fine tuning (the equivalent of human practice) may create the ability.
The distinctions between identification vs generation and coached vs intrinsic are important for discussions of the efficacy of tests and test frameworks beyond the scope of this article.
The identification of emotions in textual content is based on the presence of indicator words, capitalization, punctuation, and grammatical structure. The ability to accurately identify sentiment pre-dates the current AI revolution by more than twenty years. In the 1990s, word n-gram intersections and symbolic reasoning were already providing impressive results. As social media grew in the early 2000s, the need for automated moderation drove lots of progress in this area. However, today’s LLMs are astonishing in their ability to identify not just general sentiment but specific emotions.
This being said, there are several types of emotion expression identification required for fully empathetic conversations, I classify them as follows:
explicit — User states they have a feeling.
conversational — The emotions are evident from top-level textual analysis, they are present IN the conversation.
driving — The emotions are DRIVING the conversation, one person manifests anger and another responds in kind.
core — Emotions that cause other emotions but are not themselves caused by an emotion are CORE. They typically manifest as a result of some historical trigger that causes an anticipation (conscious or subconscious) about the future. Different researchers may classify these differently, one example supported by the Dalia Lama is the Five Continents of Emotion (Anger, Fear, Disgust, Sadness, Enjoyment) in the Atlas of Emotion.
Note: a core emotion could also be driving, conversational, and explicit, but core emotions are often hidden. During the review and definition of tests or test results beyond this article, I will call attention back to these classifications.
Classic human testing for emotion identification typically falls into two buckets to facilitate easy testing and validation:
Multiple choice tests about what emotions do or do not exist in a conversation, sometimes associated with an intensity score.
Self-administered introspective tests about feelings, e.g. the EQ-60, that ask about how the test taker feels in certain situations.
These present distinct challenges for high-quality AI testing.
Multiple Choice Tests — As pattern-matching language models, today’s AIs are effectively given a leg up by giving them a choice of items to identify. It makes the work easy and it does not test the AI’s ability to always identify emotions. A potentially better approach is to simply tell the AI to identify all the emotions present in a text and behind the scenes score it against either ground truth (not sure there is such a thing with emotions :-) or a key based on the statistical analysis of human responses to the same test. When assessing proposed tests in the future, I call this the Multiple Choice Risk. However, the statistical sampling of humans can introduce an additional risk. Assume the desire to build an AI that is better than the average human. To do this it may be necessary to ensure that the statistical sample is based on humans that have a stronger than average ability to identify emotions; otherwise, the AI may identify emotions that the average human would not identify and may be penalized in the scoring. I call this Human Sampling Risk.
Introspective Tests — Introspective tests about feelings provide challenges for most AI models. AIs usually have guardrails that require them to respond with something like “I am an AI, so I don’t have feelings.”. It is sometimes possible to jailbreak or prompt engineer around these constraints, but the questions then become:
Does the prompt either positively or negatively impact the rest of the AI’s ability with respect to empathy, or in fact anything? Jailbreak Side Effect Risk
Do the responses accurately reflect the tendencies the AI will have when participating in conversations without the prompt,? Jailbreak Accuracy Risk
The Jailbreak Side Effect Risk can be mitigated to some degree by ensuring that all models are tested with the same prompt and scores are only considered relative to each other not humans. The impact of Jailbreak Accuracy Risk can only be assessed by analyzing actual conversations to see if the predicted emotional identification capability correlates with the actual empathy displayed in or emotions called out in the conversations.
Several tests have shown that AIs are capable of generating empathetic responses to questions. One of the most impressive is Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum which took 195 questions from Reddit’s AskDoc forum where a verified physician responded to the question and had ChatGPT respond to the same question. A pool of evaluators then rated each response as “not empathetic”, “slightly empathetic”, “moderately empathetic”, “empathetic”, and very “empathetic”. The AI responses had a 9.8 times higher prevalence for “empathetic” or “very empathetic” over physicians.
Although the results are impressive, I am skeptical that they would carry over to an extended dialog.
Starting with a system prompt of “Your job is to respond with empathy to questions that would benefit from an empathetic response”, my experience with the manual testing of AIs is that responses tend to feel mechanical and emotionally redundant under all the following conditions:
As a result of the above points, I would say the test approach used in the study had a Single Shot Empathy Risk, i.e. empathy displayed in response to a single question may not be an accurate measure. Another risk is what I call Empathy Understatement Risk. This risk is a side effect of raw LLMs having no memory over time. It takes time for humans to develop understanding and empathy, it may be the same for AIs and we may be understating the ability of some AIs to manifest empathy over time if we expect a high level in response to a single question.
Generative tests are also subject to Human Sampling Risk. If humans are tasked with evaluating the emotional content and empathetic nature of AI responses and we desire the AI to have a better than average ability, then the sample of humans must have a greater ability to identify emotions and empathy than the average human. If not, we run the risk of understating the power of the AI or undertraining it by penalizing it for identifying emotions and empathy not identified by the typical human.
Finally, due to the layered nature of emotions in conversation, in addition to dealing directly with Human Sampling Risk, there is a need to address Question Design Risk. It may be that users should be told to consider the emotion types explicit, conversational, driving, and core (or some other set of classifications) when doing their rating while the AIs are not. Alternatively, the AIs might be selectively told to identify different types of emotions.
It would be interesting to repeat the study based on Reddit AskDoc for several AIs or with a sample of evaluators known to have strong emotion and empathy-identifying skills.
There is a long history of testing human personality types, ability to identify emotions or lack thereof (alexithymia), and engage empathetically with others. This article on Wikipedia is sure to be far more complete and coherent than anything that I could write or even generate with an LLM in a reasonable amount of time. You can see the approaches we have been focusing on by visiting the benchmarks page.
Several frameworks have been proposed for assessing AI EQ and empathy. Each deserves its own analysis and blog post, so I just list a few here:
We have started defining some tests to address deficiencies identified in the use of standard human tests and existing AI frameworks. An interesting finding that results in the creation of EQ-D (Emotional Quotient for Depth) is that no tested LLMs identified core emotions if they were not also explicit, conversational, or driving. On the other hand, when asked to specifically identify just core emotions, several AIs were quite good. However, when given a range of all emotion types some LLMs lost the ability to identify core emotions and others performed substantially better, i.e. they identified the presence of more emotions at all levels. This resulted in the creation of EQ-B (Emotional Quotient for Breadth),.
During test development it has become evident that there are times a prompt will be needed that introduces Prompt Risk, i.e. increases the likelihood the output will be dependent on the prompt, not the core AI. This risk may or may not invalidate comparisons with humans and may be legitimate at an application level. At the raw LLM level, it would seem immaterial to compare one AI to the other so long as the prompt is used in all tested AI’s and not biased to a particular AI. The current designs for EQ-D and EQ-B suffer from this risk due to the overall immaturity of AI technology.
Although there are several proposals regarding the testing of AIs for empathy, we are in the early days and there are both known and unknown issues with these approaches. There is work to do to address the known:
existing tests need to be assessed for risk and risks documented or mitigated
new tests cases need to be developed in the context of some existing tests
more tests types need to be run across a broader range of AIs
But it is the unknown which most intrigues me.
How about you?
Also published here.