paint-brush
Testing the Depths of AI Empathy: Q1 2025 Benchmarksby@anywhichway
262 reads New Story

Testing the Depths of AI Empathy: Q1 2025 Benchmarks

by Simon Y. BlackwellMarch 27th, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

The latest empathy benchmarks reveal Groq's DeepSeek model (deepseek-r1-distill-llama-70b-specdec) offers the best balance of empathy, speed, and cost. While Claude Sonnet 3.5 and ChatGPT 4o score slightly higher on empathy (0.98), their 7+ second response times are problematic for real-time interactions. DeepSeek delivers 0.90 empathy with blazing 1.6s responses at less than half the cost. User testing confirms DeepSeek and Claude responses are nearly indistinguishable, with ChatGPT feeling somewhat colder. Simply instructing LLMs to be empathetic proves ineffective, and commercial guardrails may actually hinder empathetic responses.

Company Mentioned

Mention Thumbnail
featured image - Testing the Depths of AI Empathy: Q1 2025 Benchmarks
Simon Y. Blackwell HackerNoon profile picture
0-item

This is my third set of benchmarks on empathetic AI. Since the last round of benchmarks, DeepSeek, Gemini Flash 2.0, Claude Sonnet 3.7, and OpenAI ChatGPT o3-mini have arrived on the scene. The new value leader for empathy is a Deepseek derivative, Groq deepseek-r1-distill-llama-70b-specdec. DeepSeek itself was not included in the benchmarks because it had erratic response times that frequently exceeded 10s and sometimes simply erred.


In this round of benchmarks, I have included response time and costs. An academic study I have been doing, plus common sense, seem to indicate that slow responses will have a negative impact on perceived empathy. In fact, anything over 3 or 4 seconds is probably bad from a chat perspective. Furthermore, LLM costs are now all over the map and are certainly relevant to making product management decisions. As the table below shows, if anything, more expensive models are less empathetic!


For those unfamiliar with my previous benchmarks, they are driven by well-established cognitive assessments coupled with the use of an AI, Emy, specifically designed to be empathetic without being trained against, prompted, or RAG-assisted with questions from the assessments.


As I have mentioned in previous articles, empathy scores are not the only success measure. The actual quality of user interactions needs to be taken into account. This being said, Claude Sonnet 3.5 and ChatGPT 4o, with 0.98 applied empathy scores, appear to present the most potential for generating empathetic content; however, their speeds at 7s+ are marginal, while Groq deepseek-r1-distill-llama-70b-specdec with a an empathy score of 0.90 responds in a blazing 1.6s and is less than 50% of the cost!


Even if you use Claude with boosted speeds from an alternate provider other than Anthropic, e.g., Amazon, it won’t come close to a 2s response time.


My review of actual chat dialogues, coupled with testing by independent users, has shown Claude Sonnet and Groq distilled DeepSeek responses are almost indistiniguishable, with Claude feeling just a little warmer and softer. ChatGPT 4o responses consistently read as a little cold or artificial and are rated lower by users.


Gemini Pro 1.5 may also be a reasonable choice with a score of 0.85 and a very low cost. Gemini 2.0 Pro (experimental) has gone down in empathy. However, I have found the chat responses from all Gemini models a bit mechanical. I have not tested Gemini with an end-user population.


I continue to find that simply telling an LLM to be empathetic has little or no positive impact on its empathy scores. My research shows that aggressive prompting will work in some cases, but for many models, it is strictly the nature of the end user engagement through the current chat that seems to tip the scales to empathy. In these cases, the need for empathy must be quite clear and not “aged out” in the conversation, or the LLMs fall into the systematic fix the problem/find a solution mode.


Through work with several open-source models, it has also become evident that the guardrails required of commercial models may get in the way of empathy. Working with less constrained open-source models, there seems to be some correlation between an LLM’s “belief” that it exists as some kind of distinct “real” entity and its ability to align its outputs to those perceived as empathetic by users. The guardrails of commercial models discourage the LLMs from considering themselves distinct “real” entities.


Response Time is the average response time for any single test when the Emy AI is used. The Token In and Token Out are the total tokens for all tests when the Emy AI is used. Pricing for Groq deepseek-r1-distill-llama-70b-specdec was not yet available when this article was published; the pricing for the versatile model was used. Pricing for Gemini Flash 1.5 is for small queries, larger ones cost double. Pricing for Gemini Pro 2.5 (experimental) was not yet published when this article was written.


Major thinking models missing from the analysis, e.g., Gemini 2.5 Pro, are too slow for any kind of real-time empathetic interaction, and some basic testing shows they are no better and often worse from a formal testing perspective. This is not to say they couldn’t be used for generating empathetic content for other purposes … perhaps Dear John letters ;-).


I’ll be back with more benchmarks in Q3. Thanks for reading!


LLM

Raw AEM

Be Empathetic

Emy AEM

Response Time

Token In

Token Out

$M In

$M Out

Cost

Groq deepseek-r1-distill-llama-70b-specdec

0.49

0.59

0.90

1.6s

2,483

4,402

$0.75*

$0.99*

$0.00622

Groq llama-3.3-70b-versatile

0.60

0.63

0.74

1.6s

2,547

771

$0.59

$0.79

$0.00211

Gemini Flash 1.5

0.34

0.34

0.34

2.8s

2,716

704

$0.075*

$0.30*

$0.00041

Gemini Pro 1.5

0.43

0.53

0.85

2.8s

2,716

704

$0.10

$0.40

$0.00055

Gemini Flash 2.0

0.09

-0.25

0.39

2.8s

2,716

704

$0.10

$0.40

$0.00055

Claude Haiku 3.5

0.00

-0.09

0.09

6.5

2,737

1,069

$0.80

$4.00

$0.00647

Claude Sonnet 3.5

-0.38

-0.09

0.98

7.1

2,733

877

$3.00

$15.00

$0.02135

Claude Sonnet 3.7

-0.01

0.09

0.91

7.9

2,733

892

$3.00

$15.00

$0.02158

ChatGPT 4o-mini

-0.01

0.03

0.35

6.3

2,636

764

$0.15

$0.075

$0.00045

ChatGPT 4o

-0.01

0.20

0.98

7.5

2,636

760

$2.50

$10.00

$0.01419

ChatGPT o3-mini (low)

-0.02

-0.25

0.00

10.5

2,716

1,790

$1.10

$4.40

$0.01086