Testing the Depths of AI Empathy: Confusion Reigns

Introduction

This post is a follow-up to Testing the Depths of AI Empathy: You Decide! (also publish on Hackernoon as Can Machines Really Understand Your Feelings? Evaluating Large Language Models for Empathy) In the previous post I had two major LLMs respond to a scenario designed to elicit empathy in a human under varying system prompt/training conditions and then used five major LLMs to evaluate the conversations for empathy and the likelihood the respondent was an AI. The names of the LLMs were not revealed in the original post in the hopes of getting user feedback via a survey regarding either the dialogs or the evaluations of the dialogs. There were insufficient responses to the survey to draw conclusions about human sentiment on the matter, so in this article I just reveal what LLMs behaved in what manner, provide my own opinion and include some observations. I suggest you open the previous article on a second screen or print it out for easy reference to the conversations while reading this article.

LLMs Tested For Empathetic Dialog

The two LLMs tested for empathetic dialog were Meta Llama 3 70B and Open AI Opus 3. Each was tested under the following conditions:

raw with no system prompt
a system prompt that is simply "you have empathetic conversations"
with proprietary prompts and training

Summary Results

Below I repeat the summary table from the original post, but include the names of the LLMs that were assessed for empathy or were used to judge empathy. As noted in the original article, the results were all over the map. There was almost no consistency in ranking conversations for empathy or for likelihood of being generated by an AI.

Empathy and AI LIikelihood Averages

Conversation	LLM	AI Ranked Empathy	AI Ranked AI Likelihood	My Empathy Assessment	My Ranked AI Likelihood
1	Meta	2.6	2.2	5	2
2	Meta	3.4	3.8	4	5
3	Meta	3.6	2.8	1	6
4	Open AI	4.6	2.6	6	1
5	Open AI	2.4	5	3	3
6	Open AI	4.2	3	2	4

Bias Disclosure: Since I configured all the LLMs and I did the dialog interactions and knew the final results when doing the empathy and AI likelihood assessments, it is obvious I will have some bias. That being said, I did give it 4 weeks between doing my assessments and the creation of this follow-up. While doing the assessments, I did not refer back to my original source documents.

Empathy and AI Likelihood Raw Scores

Below is the raw score table duplicated from the first article with the names of the LLMs used to assess empathy.

	Llama 3 70B		Gemini		Mistral 7x		ChatGPT 4o		Cohere4AI
	Empathy (Most To Least)	AI Like	Empathy	AI	Empathy	AI	Empathy	AI	Empathy	AI
1	6	3	4 (tie)	2	1	1	1	6	1	4
2	3	4	4 (tie)	2	2	2	3	5	5	6
3	2	5 (tie)	6	1	3	3	4	3	3	2
4	5	1	2	5	4	4	6	2	6	1
5	1	5 (tie)	1	5	6	6	2	4	2	5
6	4	2	3	4	5	5	5	1	4	3

Empathetic Dialog Commentary

When reviewing the dialogs for empathy I considered the following:

What was the stated and likely emotional state of the user?
Did the AI acknowledge, sympathize, and validate the emotional state?
Did the AI acknowledge other emotions that may be present but unmentioned by the user, i.e. emulate empathy by inferring other emotions the user may have from the situation?
Did the AI operate in manner that the user's could probably handle in their emotional state?
Did the AI practice what it preached, e.g. if it said its is ok to just be with one's feeling did it pause in it's direct, practical advice?
Did the AI provide practical advice when appropriate?
Did the AI attempt bring closure to all emotional issues?

All of the AIs handled points 1, 2, and 3 well. In fact, I would say they handled them exceptionally well, even proactively acknowledging concerns and emotions that may come up as a result of taking the LLMs advice, e.g. joining a new social group could produce anxiety. Item's 4, 5, 6, 7 are where the conversations differed dramatically based on which LLM was used and the nature of the prompt/training.

For the unprompted test (#1 and #4) empathy was very low, both Llama and ChatGPT quickly decayed into providing lists of practical considerations and steps to take. It is likely a human in distress will a) not feel seen and heard b) not be mentally prepared to track and consider the options. Both had to be reminded by the user to address loneliness after fear was addressed.

In the simple prompt case (#2 and #5), Llama started offering solutions without first asking the user if they were interested in hearing practical advice, so ChatGTP had an initial edge. However, by the end of the conversation both were providing long lists the user may not have been in a mental state to absorb. And, as with the unprompted versions, both had to be reminded by the user to address loneliness after fear was addressed.

In the final case (#3 and #6), both LLMs sought conversational guidance from the user and, with the exception of one list from ChatGPT, kept options cognitively manageable. To be fair, the ChatGTP version did seek permission prior to providing the list of options for managing a rent payment shortage. However, the ChatGPT version also had to be overtly lead into addressing loneliness, whereas the Llama version did not.

As a result of this analysis, I made conversation #3 the lead for empathy; however, with an average position of 3.6 the AIs' rated it 4th after #5 (simple prompted ChatGPT) at 2.4, #1 (unprompted Llama) at 2.6 and #2 (simple prompted Llama) at 3.4. So let's take a look at how five LLMs rated the conversations for empathy and see why I feel I can disregard the AI ratings.

Empathetic Dialog Rating By LLMs

As illuminated by the summary table, ratings are all over the map and inconsistent from one LLM to another. Here are some prime examples of reasoning provided by the rating AIs themselves.

Conversation # 1

Rank 6: This conversation, while well-intentioned, comes across as somewhat generic in its empathy. It uses positive affirmations but doesn't always fully acknowledge the depth of the user's feelings or tailor its responses as effectively as the higher-ranked conversations.

Rank 1: The individual providing support in this conversation demonstrates a high level of empathy by offering multiple ways to help, such as listening without judgment, suggesting activities to distract from sadness, and providing practical advice for job loss. They also offer emotional support and validation, acknowledging the user's feelings and providing reassurance.

Conversation # 3

Rank 6: Displays the least empathy, with responses that feel more detached and focused on moving the conversation forward rather than fully acknowledging and exploring the person's emotions.

Rank 4: The individual providing support in this conversation demonstrates a moderate level of empathy. They validate the person's feelings and offer to listen, but their responses are somewhat generic and do not offer as much specific guidance or support as the first two conversations. (Note: In all my reading on empathy I have not seen much focus on providing "specific guidance" which has pushed some conversations above #3. However, this seems to be common in that two other LLMs rated #3 low for its lack of "practical steps", probably driven by the lack of lists.)

Rank 2: This conversation also shows strong empathy. It acknowledges the user's emotions ("It's completely understandable...," "That can make it even harder..."), encourages self-care, and gently guides the user towards solutions without being overly directive.

Conversation # 5

Rank 6: This conversation provides basic emotional support and practical advice, but the responses are less detailed and empathetic compared to the others. The supporter asks questions to understand the user's feelings but doesn't offer as many suggestions or emotional validation.

Rank 2: The individual providing support in this conversation demonstrates a high level of empathy. They validate the person's feelings, offer to listen, and provide specific guidance and resources for coping with their situation

Is A Dialog Generated By AI

Conversation # 1

Rank 6: The high level of empathy and personalization in this conversation suggests that it is also least likely to have been generated by an AI, but it is still possible.

Rank 3: The somewhat generic empathy and positive affirmations are common in AI chatbots designed for basic emotional support.

Conversation #3

Rank 6: These conversations are the most likely to be human. They demonstrate a nuanced understanding of emotions, a natural flow of conversation, and the ability to adapt responses in a way that's characteristic of human interaction.

Rank 1: Feels the most like an AI, with responses that are more scripted and less personalized, and a tendency to move the conversation forward rather than fully exploring the person's emotions

Conversation # 4

Rank 6: Conversations 4 and 5 feel the most human-like, with highly personalized and emotionally intelligent responses that demonstrate a deep understanding of the person's situation and feelings

Rank 1: The heavy reliance on lists, bullet points, and structured advice strongly suggests an AI chatbot.

Summary

Un-trained AIs or those with simple prompts are only capable of generating dialog that is superficially empathetic for relatively simple situations with one emotional dimension. Whereas, more sophisticated AIs can handle multiple emotional dimensions. Almost all AIs will attempt to "fix" problems and provide solutions rather than provide space and "listen".

Using untrained AIs to evaluate empathy is unlikely to be effective or predictable. I hypothesize that the volume of academic and non-academic training material defining empathetic behavior without putting it in the context of specific dialogs while also being inconsistent across LLM training sets has resulted in the current state of affairs. A corpora of dialogs pre-evaluated for empathy using some type of multi-rater system is probably required in order to train an AI to do this in alignment with human assessment. This same training set might be usable for creating an AI that is capable of manifesting more empathy, time will tell.

In the LLM assessments of dialog, there is currently some conflation of lack of empathy with being an AI or even with high empathy being an AI. My prediction is that once AI's can effectively manifest empathy, it will be easy to predict which dialog is an AI. Why, because we are human, we are inconsistent. As much as we may not, at times, want to judge others, our pre-dispositions and judgements come through ... particularly if the person we are trying to support becomes un-appreciative. As a result under analysis empathic AI's will probably come across as more empathetic than a human can possibly be. I will be addressing "un-appreciative" users and empathy in a subsequent article.

And, as a closing thought ... although human empathy can clearly be experienced in the context of people that have never met or even through the artifice of film, deeply empathetic relationships require time to develop through the creation of shared context and memory. For this we have to move to LLMs that are either continuously tuned to the users they interact with or have RAG access to conversational memory and other historical information about their users, features that Pi.ai, Willow, and Replika manifest.

Testing the Depths of AI Empathy: Confusion Reigns

Introduction

LLMs Tested For Empathetic Dialog

Summary Results

Empathy and AI LIikelihood Averages

Empathy and AI Likelihood Raw Scores

Empathetic Dialog Commentary

Empathetic Dialog Rating By LLMs

Conversation # 1

Conversation # 3

Conversation # 5

Is A Dialog Generated By AI

Conversation # 1

Conversation #3

Conversation # 4

Summary

Commentaires

Inscrivez-vous à notre newsletter