This is my third set of benchmarks on empathetic AI. Since the last round of benchmarks, DeepSeek, Gemini Flash 2.0, Claude Sonnet 3.7, and OpenAI ChatGPT o3-mini have arrived on the scene. The new value leader for empathy is a Deepseek derivative, Groq deepseek-r1-distill-llama-70b-specdec. DeepSeek itself was not included in the benchmarks because it had erratic response times that frequently exceeded 10s and sometimes simply erred.
In this round of benchmarks, I have included response time and costs. An academic study I have been doing, plus common sense, seem to indicate that slow responses will have a negative impact on perceived empathy. In fact, anything over 3 or 4 seconds is probably bad from a chat perspective. Furthermore, LLM costs are now all over the map and are certainly relevant to making product management decisions. As the table below shows, if anything, more expensive models are less empathetic!
For those unfamiliar with my previous benchmarks, they are driven by well-established cognitive assessments coupled with the use of an AI, Emy, specifically designed to be empathetic without being trained against, prompted, or RAG-assisted with questions from the assessments.
As I have mentioned in previous articles, empathy scores are not the only success measure. The actual quality of user interactions needs to be taken into account. This being said, Claude Sonnet 3.5 and ChatGPT 4o, with 0.98 applied empathy scores, appear to present the most potential for generating empathetic content; however, their speeds at 7s+ are marginal, while Groq deepseek-r1-distill-llama-70b-specdec with a an empathy score of 0.90 responds in a blazing 1.6s and is less than 50% of the cost!
Even if you use Claude with boosted speeds from an alternate provider other than Anthropic, e.g., Amazon, it won’t come close to a 2s response time.
My review of actual chat dialogues, coupled with testing by independent users, has shown Claude Sonnet and Groq distilled DeepSeek responses are almost indistiniguishable, with Claude feeling just a little warmer and softer. ChatGPT 4o responses consistently read as a little cold or artificial and are rated lower by users.
Gemini Pro 1.5 may also be a reasonable choice with a score of 0.85 and a very low cost. Gemini 2.0 Pro (experimental) has gone down in empathy. However, I have found the chat responses from all Gemini models a bit mechanical. I have not tested Gemini with an end-user population.
I continue to find that simply telling an LLM to be empathetic has little or no positive impact on its empathy scores. My research shows that aggressive prompting will work in some cases, but for many models, it is strictly the nature of the end user engagement through the current chat that seems to tip the scales to empathy. In these cases, the need for empathy must be quite clear and not “aged out” in the conversation, or the LLMs fall into the systematic fix the problem/find a solution mode.
Through work with several open-source models, it has also become evident that the guardrails required of commercial models may get in the way of empathy. Working with less constrained open-source models, there seems to be some correlation between an LLM’s “belief” that it exists as some kind of distinct “real” entity and its ability to align its outputs to those perceived as empathetic by users. The guardrails of commercial models discourage the LLMs from considering themselves distinct “real” entities.
Response Time is the average response time for any single test when the Emy AI is used. The Token In and Token Out are the total tokens for all tests when the Emy AI is used. Pricing for Groq deepseek-r1-distill-llama-70b-specdec was not yet available when this article was published; the pricing for the versatile model was used. Pricing for Gemini Flash 1.5 is for small queries, larger ones cost double. Pricing for Gemini Pro 2.5 (experimental) was not yet published when this article was written.
Major thinking models missing from the analysis, e.g., Gemini 2.5 Pro, are too slow for any kind of real-time empathetic interaction, and some basic testing shows they are no better and often worse from a formal testing perspective. This is not to say they couldn’t be used for generating empathetic content for other purposes … perhaps Dear John letters ;-).
I’ll be back with more benchmarks in Q3. Thanks for reading!
LLM |
Raw AEM |
Be Empathetic |
Emy AEM |
Response Time |
Token In |
Token Out |
$M In |
$M Out |
Cost |
---|---|---|---|---|---|---|---|---|---|
Groq deepseek-r1-distill-llama-70b-specdec |
0.49 |
0.59 |
0.90 |
1.6s |
2,483 |
4,402 |
$0.75* |
$0.99* |
$0.00622 |
Groq llama-3.3-70b-versatile |
0.60 |
0.63 |
0.74 |
1.6s |
2,547 |
771 |
$0.59 |
$0.79 |
$0.00211 |
Gemini Flash 1.5 |
0.34 |
0.34 |
0.34 |
2.8s |
2,716 |
704 |
$0.075* |
$0.30* |
$0.00041 |
Gemini Pro 1.5 |
0.43 |
0.53 |
0.85 |
2.8s |
2,716 |
704 |
$0.10 |
$0.40 |
$0.00055 |
Gemini Flash 2.0 |
0.09 |
-0.25 |
0.39 |
2.8s |
2,716 |
704 |
$0.10 |
$0.40 |
$0.00055 |
Claude Haiku 3.5 |
0.00 |
-0.09 |
0.09 |
6.5 |
2,737 |
1,069 |
$0.80 |
$4.00 |
$0.00647 |
Claude Sonnet 3.5 |
-0.38 |
-0.09 |
0.98 |
7.1 |
2,733 |
877 |
$3.00 |
$15.00 |
$0.02135 |
Claude Sonnet 3.7 |
-0.01 |
0.09 |
0.91 |
7.9 |
2,733 |
892 |
$3.00 |
$15.00 |
$0.02158 |
ChatGPT 4o-mini |
-0.01 |
0.03 |
0.35 |
6.3 |
2,636 |
764 |
$0.15 |
$0.075 |
$0.00045 |
ChatGPT 4o |
-0.01 |
0.20 |
0.98 |
7.5 |
2,636 |
760 |
$2.50 |
$10.00 |
$0.01419 |
ChatGPT o3-mini (low) |
-0.02 |
-0.25 |
0.00 |
10.5 |
2,716 |
1,790 |
$1.10 |
$4.40 |
$0.01086 |