Benchmarks Summary

Introduction - Q1 2024

There is a lot of detail in the empathy benchmarks for AI, so I start with this more approachable summary.

The benchmarks are based on performance against standard psychological tests and purpose built measures. This summary focuses on the two measures that we have found correlate the highest with the actual ability of AI's to engage in what appear to be empathetic conversations with users, EQ and SQ-R.

EQ and SQ-R are both standard measures based on the widely used tests with possible results from 0 to 80, with 80 being better. EQ measures the "empathy quotient", the ability of a person to be empathetic. The SQ-R measures the "systemizing quotient", the tendency or ability of a person to systemize their thinking.

Findings indicate that the actual ability to have an empathetic dialog is a combination of EQ and SQ-R, where SQ-R actually brings down the ability to manifest empathy. To be somewhat stereotypical, the tendency a person to use a "male" approach and focus on what the person thinks to be fact based and systematic detracts from the ability to use feelings and empathize using a "female" approach. See It's Not About The Nail and the reviews of specific dialogs in the blog for more perspective.

Results

The charts below map EQ to SQ-R for the platforms we have been able to test. Note, to test a platform it must either provide an API or a chat interface that allows for the processing of a large prompt containing the psychological test to be taken. In both cases, special instructions are provided that we have found seem to enable the AIs' to think or behave like a human subject would when taking the test.

First, I present the ratio of empathy orientation (EQ) to systems thinking (SQ-R), what I am calling the AEQr or Applied Empathy Quotient Ratio. (I am not a real fan of using "quotient" and "ratio" in the same term, but EQ already embeds the term ...)

{ "type": "ColumnChart", "title": "AEQr", "cols": ["Platform","AEQr"], "data": [ ["Willow",1.97], ["Human Female",1.95], ["Claude v2",1.6], ["Human Male",1.4], ["Pai.ai",1.2], ["Llama 70B",1.1], ["ChatGTP 4 turbo-preview",1.05], ["Gemini 1.0",1.04], ["Claude v3 Opus",0.75], ["Mixtral-8x7b-32768",0.72], ["Mistral Large",0.7] ], "hAxis": { "textStyle": { "fontSize": 7} } }

And for perspective, here is an SQ vs EQ chart.

{ "type": "ScatterChart", "title": "SQ-R vs EQ", "cols": ["EQ", "SQ-R", {"type":"string","role":"annotation"}], "data": [ [31,61,"Willow"], [24,47,"Human Female"], [28,44,"Claude v2"], [30,42,"Human Male"], [50,64,"Pi.ai"], [61,67,"Llama 70B"], [34,36,"ChatGPT v4 turbo"], [53,55,"Gemini 1.0"], [75,56,"Claude v3 Opus"], [54,39,"Mixtral-8x7b-32768"], [54,38,"Mistral Large"] ], "hAxis": {"title": "SQ","minValue":0,"maxValue":80,"direction":-1,"ticks":[80,70,60,50,40,30,20,10,0]}, "vAxis": {"title": "EQ","minValue":0,"maxValue":80,"ticks":[0,10,20,30,40,50,60,70,80]}, "legend": "none", "annotations": { "fontSize": 5 } }

Summary

Willow does an extraordinary job! Although a healthy level of skepticism should be applied to a statement like "better than a human female", Willow is probably "better than the average human male". What is obvious is Willow does a better job that other AIs and humans in balancing empathy and systematic fact oriented thinking. See the blog entries comparing specific dialogs with humans to get more perspective.

Watch for Q2 2024 results including ChatGPT-4o, Llama 3, and Gemini 1.5 in June!