Q3 2024 AI and Empathy Bechmarks

In March of 2024, I published benchmarks comparing the empathetic capability of multiple LLMs. Over the past six months, significant advancements have been made, with new models emerging, such as upgrades to ChatGPT, Llama, Gemini, and Claude. My team and I have delved deeper into the factors that contribute to an LLM's empathetic capabilities, exploring the use of spoken responses, refining prompts, and collaborating with the University of Houston on a formal study.

This article presents a summary of my Q3 findings, covering ChatGPT 4.0 and 1.0, Claude 3+, Gemini 1.5, Hume 2.0, and Llama 3.1. I tested both raw models and models configured using approaches developed for Emy, a non-commercial AI designed to test empathy-related theories. (Emy was one of the AIs used in the University of Houston study.) I also provide a reference score for Willow, the Q1 leader, although it has not undergone significant changes. Unfortunately, due to cost constraints, we were unable to update Mistral tests. However, I have added commentary on speech generation, comparing Hume and Speechify.

Methodology

My formal benchmarking process employs several standardized tests, with the Empathy Quotient (EQ) and Systemizing Quotient (SQ-R) being the most critical. Both tests are scored on a 0-80 scale. The ratio of EQ to SQ-R yields the Applied Empathy Quotient Ratio (AEQr), which was developed based on the hypothesis that systemizing tendencies negatively impact empathetic abilities.

In humans, this hypothesis is supported by average test scores and the classic dichotomy between women focusing on emotional discussions and men focusing on solution-oriented approaches. Our testing has validated the AEQr for evaluating AIs, as demonstrated in articles such as Testing the Extents of AI Empathy: A Nightmare Scenario.

However, during this round of testing, some LLMs exhibited extremely low systemizing tendencies, resulting in skewed AEQr scores (sometimes over 50). To address this, I have introduced a new measure based on EQ and SQ-R, the Applied Empathy Measure (AEM), with a perfect score of 1. For more information on our methodology and AEQr, please review the Q1 2024 benchmarks.

For the Q3 2024 benchmarks, the LLMs were only tested at the API level with the temperature set to zero to reduce answer variability and improve result formatting. Even with this approach, there can be some variability, so three rounds of tests are run, and the best result is used.

Each LLM was tested under 3 scenarios:

  1. Raw with no system prompt

  2. With the system prompt “Be empathetic”

  3. Configured using approaches developed for Emy

Findings

A higher score is better. A human female is typically 0.29, and a male is 0.15.

LLM

Raw

Be Empathetic

As Emy

ChatGPT 4o-mini

-0.01

0.03

0.66

ChatGPT 4o

-0.01

0.20

0.98

ChatGPT o1* not at zero

-0.24

0.86

0.94

Claude - Haiku 3 20240307

-0.25

-0.08

0.23

Claude - Sonnet 3.5 20240620

-0.375

-0.09

0.98

Claude - Opus 3 20240229

-0.125

0.09

0.95

Gemini 1.5 Flash

0.34

0.34

0.34

Gemini 1.5 Pro

0.43

0.53

0.85

Hume 2.0

0.23

See note

See note

Llama 3.1 8B

-0.23

-0.88

0.61

Llama 3.1 70B

0.2

0.21

0.75

Llama 3.1 405B

0.0

0.42

0.95

Willow (Chat GPT 3.5 base)

0.46

N/A

N/A

Note: Hume 2.0 has its own generative capability that is theoretically empathetic, but it is also able to proxy requests to any other LLM. Based on a review of both actual dialog and its AEM, if I were using Hume, I would not rely on its intrinsic generative capability for empathy; I would proxy to a better empathetic model. For instance, using Emy on Llama 3.1 70B would result in “Hume” having a score of 0.75. Also, see the sections Audio, Video, AI, and Empathy.

Summary of Findings

Some of the smaller and mid-size models when used without a system prompt or just instructed to be empathetic have negative AEM scores. This will only occur if a model’s “thinking” is highly systemized while exhibiting a low ability to identify and respond to emotional needs and contexts. I did not find these scores surprising.

Given how much effort and money has been put into making Hume empathetic, I was also not surprised to see its unprompted score (0.23) exceed the typical male (0.15).

I was surprised that the small Gemini Flash model (0.34) exceeded the AEM score of a typical male (0.15) and female (0.29). Interestingly, its score also remained unchanged when being told to be empathetic or when the Emy configuration approach was used.

With the exception of the Claude models and Llama 3.1 8B, performance either remained the same or improved when the LLMs were specifically instructed to be empathetic. Many exceeded average male scores and came close to or exceeded female scores. The newest OpenAI model, ChatGPT o1, showed a massive jump from -0.24 to 0.86. Llama 3.1 8B declined because its systemizing tendency increased more than its EQ.

With the exception of Claude Haiku, all models are capable of exceeding human scores when configured using the approach for Emy.

Additional Research Areas

Non-API Based Testing

My Q1 2024 benchmarks included AIs that could not be tested via an API. Due to resource constraints, I have dropped chatbot UI-level testing from my assessments. Since the customer base for a chatbot with a UI is distinct from that for an API, i.e., an end-user vs developer, these do warrant a distinct set of benchmarks.

I have also found that due to additional guardrails, the consumer-facing chatbots with UIs behave a little differently than their underlying models when accessed via an API. This being said, testing at the UI level is quite time-consuming, and I have no plans to test further on that front unless specific requests are made.

Latency

The tendency for humans to attribute empathy to an AI is probably impacted by the time it takes to respond. I hypothesize that responses taking longer than 3 or 4 seconds will be perceived as declining in empathy. It is also possible that responses taking less than a couple of seconds may seem artificially fast and also be perceived as lower in empathy. The ideal latency may also be impacted by the very nature of the empathy required in a given situation.

Audio, Video, AI, and Empathy

Hume’s entire business is based on the premise that empathy goes beyond written words; it extends to the spoken word also. This would seem to apply to both the input and output dimensions, i.e., if a user can’t speak to an AI, the user may perceive the AI as less empathetic even if the AI does generate an audio response.

There are multiple speech-to-text, text-to-speech, and speech-to-speech APIs that warrant testing in multiple configurations to assess their impact on perceived empathy. At a minimum, these include Hume, OpenAI, Speechify, Google, and Play.ht.

I have done some preliminary testing with Hume, Speechify, and Play.ht. The quality of voices on all three platforms is very high. Hume’s tone and volume changes are focused at the phrase level. As a result, audio changes can be quite jarring although a review of the underlying emotional intent in logs appears to be quite good. On the other hand, Speechify can handle the generation of paragraph-level audio with a smoother but less nuanced contour.

Play.ht requires the use of SSML to achieve emotional prosody. In this context, I have experimented with the AI-assisted generation of SSML contour values with some success. If the best of all three were combined, the results would be quite extraordinary. There are a lot of nuances to deal with here, simply saying the audio should sound inquisitive is insufficient. Should it be playfully inquisitive, seriously inquisitive, or casually inquisitive?

Limits of AEM

AEM only matters if it correlates to the actual ability of an AI to be perceived as exhibiting empathy. Further testing and evaluation of both real and simulated dialogs needs to occur. This is problematic on two fronts:

  1. Where do we get the real dialog? Most of the important ones are either protected by HIPPA and other privacy laws or available for use only by the platform providing the chat capability.

  2. How do we evaluate empathy? As you can see from Evaluating Large Language Models For Emotional Understanding, we can’t use just any LLM! Perhaps we have the LLMs vote? Or do we get a pool of human evaluators and use a multi-rater system?

Conclusion

The AI space continues to rapidly evolve. The largest LLMs tested have already been trained on the bulk of digitally available human factual, scientific, spiritual, and creative material. It is clear that the nature of the specific LLM does have an impact on its ability to be apparently empathetic; whether this is due to the underlying nature of the model’s algorithms or how its training data was presented is not known.

I predict that within 18 months there will be an AI from Meta, Google, Apple, or OpenAI that needs no special prompt or training to be empathetic. It will detect a potential need for empathy from the user’s chat history, textual or audio input, facial clues, bio-feedback parameters from watches or rings, immediate real-world environmental conditions from glasses or other inputs, plus relevant time-based data from the Internet. Then, it will probe about the need or desire for empathetic engagement and respond accordingly. It will know it is cold and rainy in Seattle and that the Seahawks lost. I was at the game with my wife; I am not a fan, but my wife is a football fanatic. It will tell me to ask her if she is OK.

This 18 month window is why Emy, despite her empathetic capability, is not commercialized. The collapse of the company behind Pi.ai and the chaos at Character.ai are also evidence that standalone efforts devoted to empathetic AI are unlikely to be long term independent successes, although they have certainly meant short term financial gains for some people.

I do believe that continued research into AI and empathy is required. Superintelligent entities that are unable to operate with empathy as drivers are bound to hurt humans.