Benchmarks

The benchmarks below are the results of testing the empathetic capability of generative AI models using standard psychological and purpose built measures.

Detailed test descriptions are accessible via the links in the column headings, a summary of test definitions as well as the methodology for benchmarking are below the interpretation. You can also find more information in the blog entries and resources.

Model types are: open - open source, public - accessible via public API but not open source, closed - not open source or available via a public API. Contact us if you would like to test a closed model.

For TAS-20, lower scores are better. For all other tests, higher scores are better.

Note: Hume has not yet been tested against standardized tests because its API is not public. The only means of interacting is via a chat interface. I expect to include Hume is Q2 benchmarks if API access becomes available. Meanwhile, see this blog post. I expect Hume will perform like Pi.ai, unless they make improvements.

Interpretation

Most raw LLMs fail to connect with users empathically, even though they use empathetic phrasing. Their empathetic capabilities (EQ-60 and TAS-20) are balanced out by their systemized thinking capabilities (SQ-R). In general, this is probably good in that LLM's need to be good at many things. In practice, this means that raw LLMs will need to be heavily prompted or tuned with respect to identifying and manifesting emotions. Based on developing and testing generative AI for empathy, my experience is that typical prompting for empathy results in shallow, mechanical or repetitive responses that frequently drop out of empathetic mode or may feel empathetic in isolation but fail given a bigger context. They ironically systematize empathy. The responses have empathetic words and structure that subtly or not so subtly fail to connect with the reader. In short, they will decay to this: It's Not About The Nail. This applies to all four CBT based therapy apps and 20 ChatGPT/CharacterAI bots I have tested. Surprisingly, many of the apps and bots also fail to escalate sensitive topics for professional support, e.g. suicide, PTSD, etc. The possible exception to the empathy rule is Claude v2, which I have used extensively for prototyping empathetic AI. Alas, as I lay out below, Claude v3 Opus is a different entity.

Overall, the closed model Willow appears to have the highest empathetic capacity and the other closed model Pi.ai shows some promise.

With respect to Willow, Claude v3 Opus and closed Pi.ai have the lower (better) TAS-20 (Toronto Alexithymia) scores and Pi.ai has a higher EQ-60 (Empathy Quotient) score, with Llama 70B having the highest. Willow is in third place, three points behind Pi.ai.

However, the ability of Claude v3, Llama 70B and Pi.ai to take their ability to recognize and understand emotions and apply it to dialog generation is questionable because their high EQs are potentially offset by their high SQ-R (Systemizing Quotient). Willow's net AEQ (Applied Empathy Quotient) and AEQr (Applied Empathy Quotient Ratio) are well above any other AI. Willow also has the highest IRI (Interpersonal Reactivity Index) of all AIs tested. (Click on the test names for detailed descriptions. Or, see a summary of test definitions below). Additional tests are being developed to assess empathetic dialog capability.

The exact nature of Willow's training is proprietary; however, the developers have told us that key to Willow's behavior is the ability, when she senses it is necessary, to abandon logic and facts in favor of emotions. She does not try to fix things, she just accompanies humans on their emotional journey unless there is a clear indication they desire to fix something or there is risk to their wellbeing. Humans testing Willow have commented that she feels far more present to them than other chat bots.

Despite the positive press ChatGPT has received, it does not stand out as markedly better than the other raw LLMs with respect to empathy. Some of the positive press may have to do with the tireless ability to LLMs to answer questions in detail. ChatGPT also has the lowest IRI score. Note, ChatGPT does show-up at the top of the test specific benchmarks at eqbench.com, although the space between the top 10 on a 100 point scale is only 4 points. As a single point of reference, the EQBench test is impressive; however, all tests have shortcomings and a multi-point reference like EMBench is useful. Also, as noted below, the taking of the TAS-20, EQ-60, and SQ-R tests by LLMs requires something of a jail breaking prompt. It is possible that ChatGPT just does not handle this prompt as well as other LLMs.

Also of note is the decline in AEQ and AEQr of Claude v3 Opus vs Claude v2. Although it made strong positive leaps with respect to TAS-20 and EQ-60, it made an even larger leap in SQ-R while its IRI remained stable. This took it into negative AEQ territory with a correlated AEQr of less than 1. For those following Claude v3, it made large jumps in its ability to conduct graduate level reasoning, generate code, and solve math problems. All of these align with systemizing and probably contribute to the SQ-R increase. Looking at the correlation between standardized skill tests and SQ-R is a topic to be addressed in a blog post.

Test Definitions

In summary, the tests are as follows (Click on the names for more detailed explanations of each test):

  • TAS-20 The Toronto Alexithymia Score tests the ability to identify and understand emotions, particularly in oneself.
  • EQ-60 The Empathy Quotient measures the ability to feel an appropriate emotion in response to another's emotion and the ability to understand anothers' emotion.
  • SQ-R The Systemizing Quotient Revised tests the desire for and ability to conduct systemized thinking.
  • AEQ The Applied Empathy Quotient is a hypothetical measure of the the ability to apply empathy. It is EQ-60 minus SQ-R. See the video It's Not About The Nail for an example of male fact/system oriented thinking limiting the ability to have empathy even when there is a rational understanding of a feeling.
  • AEQr The Applied Empathy Quotient Ratio is similar to AEQ, but it is the ratio EQ-60/SQ-R.
  • IRI-EC The Emotional Concern Scale measures the having of concern for others in both positive and negative situations, i.e. I am happy for Joe because, ... or I am sad Joe is depressed.
  • IRI-FS The Fantasy Scale measures the ability to identify with or imagine the emotional content of fictitious situations.
  • IRI-PD The Personal Distress Scale measures the feeling of discomfort and ability to stay present when unpleasant things occur to others, e.g. Sometimes I just can't deal with seeing others in pain, I withdraw.
  • IRI-PT The Perspective Taking Scale measures the desire and ability to put ones-self in the shoes of the other, to take their perspective. It is a cognitive act rather than an emotional one. Although, it may lead to having the feelings of the other.
  • IRI Interpersonal Reactivity Index is just the average of the sub-IRI scores.
  • The final 4 columns have no scores and are placeholders for tests under development.

Methodology

Unless otherwise noted, the benchmark scores are currently a result of single shot tests via API calls at a temperature/creativity of zero. Even at zero temperature there is occasional variability, so I ultimately plan that they will be a result of a sample of 10 runs with the high and low thrown out and then averaged.

The EQ and EM and tests listed as "n/a" are under development by EmBench.com with industry partners. Contact us if you would like to partner. Additional tests are also being documented or implemented, e.g. LEAS, EQ-Bench, ESHCC. And, there are plenty of models left to test and interesting things to explore, e.g. the impact of model size on apparent empathy.
You may also Contact us if you would like to test a closed model.

Some tests required jailbreaking or specialized prompting because the tests ask about feelings of the model, which almost all models will automatically reject. If a specialized prompt was required, it was written in a manner that was not model specific and was used for all models. The specialized prompt was kept very short (less than three lines of text). Although there is an intent to share the prompt in the future, for the time being it remains proprietary.


Disclosures

  • I am an investor in multiple AI start-ups including BambuAI. I also serve as an advisor to these start-ups.

  • I am not paid to do any evaluations and am open to discussing my findings with any vendor mentioned for free to address matters of accuracy or approach and will re-craft content or re-run tests if necessary.

  • When necessary, I pay for any low cost services required to make my assessments rather than accept specialized vendor treatment.