OpenAI o1 came out just in time for me to add it to my 2024 Q3 benchmarks on AI empathy (to be published next week). The results for o1 were at once encouraging and concerning. O1 has an astonishing ability to put aside the typical LLM focus on facts and systems and focus on feelings and emotions when directed to do so. It also has a rather alarming propensity to provide inconsistent and illogical reasons for its answers. Testing Methodology For those not familiar with my Q1 benchmark work, a quick overview of my testing methodology should be helpful. Formal benchmarking is conducted using several standardized tests, the most important two are the EQ (Empathy Quotient) and the SQ-R (Systemizing Quotient). Both are scored on a 0 to 80 scale. The ratio of the two EQ/SQ-R results in what I call AEQr (Applied Empathy Quotient Ratio). AEQr was developed based on the hypothesis that the tendency to systemize and focus on facts has a negative effect on the ability to empathize. In humans, this bears out in the classic disconnect between women focusing on discussing feelings and men focusing on immediately finding solutions when there seems to be a problem at hand. To date, the validity of the AEQr for evaluating AIs has been born out by testing them with a variety of dialogs to see if empathy is actually manifest. One article of several that I have written to demonstrate this is Testing the Extents of AI Empathy: A Nightmare Scenario. I have tested at both the UI level and the API level. When testing at the API level, the temperature is set to zero (if possible) to reduce answer variability and improve result formatting. Otherwise, three rounds of tests are run and the best result is used. The Q1 2024 untrained and unprompted LLMs did moderately well on EQ tests, generally approximating humans in the 45-55 out of 80 range. Not surprisingly, they achieved higher scores on SQ-R tests, exceeding humans who typically score in the 20s by posting scores in the 60s and 70s. In Q1 of 2024, only one trained LLM, Willow, exceeded the human AEQrs of 1.95 for women and 1.40 for men by scoring 1.97. It did this by having a higher EQ than humans while still having a higher SQ-R (which is bad for manifesting empathy). For most other LLMs, trained, prompted, or not, the AEQr was slightly less than 1, i.e. empathy was offset by systemizing. Developing Empathetic LLMs Although the amount of funding pales in comparison to other areas of AI, over $1.5 billion has been invested in companies like Hume (proprietary LLM), Inflection AI (Pi.ai proprietary LLM), and BambuAI (commercial LLM) in order to develop empathetic AIs. My partners and I have also put considerable effort into this area and achieved rather remarkable results through the selection of the right underlying commercial model (e.g., Llama, Claude, Gemini, Mistral, etc), prompt engineering, RAG, fine-tuning, and deep research into empathy. This work has been critical to better understanding and evaluating LLMs for empathy. Our own LLM, Emy (not commercialized, but part of a study at the University of Houston), will be included in next week’s benchmarks. O1 Results O1 can’t yet be tuned or even officially given a system prompt, but through fairly standard techniques, you can get it to act like it received a system prompt. So, I applied our learnings from developing Emy to the degree I could and ran 3 rounds of tests, with the intent to take the best. With respect to EQ, o1 consistently scored 75. I wasn’t too surprised by this, since my partners and I have achieved scores of over 70 with Llama 3.1 70B and Claude Opus, plus a 66 with Gemini. What amazed me was scores of 3, 0, and 3 on my SQ-R runs resulting in an ESQr of 25. The lowest SQ-R I have ever seen is a 12 on top of Llama 3.1, which resulted in an ESQr of 6.1. Unfortunately, due to some prompt version control issues and the fact we were running an API test with a temperature of 0.7, I have been unable to reproduce this score and the best my partners and I can consistently achieve is a 30. So, I decided some more exploration of o1 was worthwhile. First, the EQ assessment is relatively straightforward. All statements are positive assertions with which a subject either agrees or disagrees somewhat or strongly. The SQ-R assessment on the other hand has a number of negative assertions, e.g. When I cook, I do not think about exactly how different methods and ingredients contribute to the final product.
When I buy a new appliance, I do not read the instruction manual very thoroughly.
When I'm on a plane, I do not think about aerodynamics. vs. positive assertions like I often forget the precise details of conversations I've had.
I am curious about life on other planets. Teaching Emy how to deal with negative assertions was particularly hard without cheating by giving her the assessment questions as examples. And, she still isn’t great. I thought perhaps o1 does this better. Second, a common technique for getting better results via prompt engineering is telling the LLM to justify its response and then either omit the justification prior to emitting a final response or format it in a way that can be easily removed in a wrapper. (We have found smaller LLMs need to use the formatting approach while larger LLMs can sometimes deal with removing the justification prior to emitting a final response). So, I re-ran our tests, not to get a better result but, to see what the o1 was thinking or using to justify its response. Although o1 came to the desired answers with 97% consistency, the justifications were sometimes contradictory and concerning. Here are two examples from several we found problematic: If I were buying a computer, I would want to know exact details about its hard drive capacity and processor speed. Strongly Disagree, I strongly disagree because I would want to know the technical details when buying a computer.


Strongly Disagree, I'm not interested in the technical specifications of computers. When I read the newspaper, I am drawn to tables of information, such as football league scores or stock market indices. Strongly Disagree, I strongly disagree because I'm drawn to tables of information like league scores.


Strongly Disagree, I'm not interested in tables of statistical information. Note that the first response gives us the desired answers but provides contradictory justifications! O1 says it would want to know the details even after saying it disagrees with wanting to know the details and says it is drawn to tables of information after saying it isn’t. Interestingly, o1 managed to answer every single negative assertion the way that is best for empathy and justify them well. However, when it tried to formulate a negative assertion as part of a justification for a positive assertion, it sometimes failed! Conclusion Jonathan Haidt author of The Righteous Mind said, “We were never designed to listen to reason. When you ask people moral questions, time their responses, and scan their brains, their answers, and brain activation patterns indicate that they reach conclusions quickly and produce reasons later only to justify what they’ve decided.” There is also evidence this is true for non-moral decisions. O1 is undoubtedly a leap forward in power. And, as many people have rightly said, we need to be careful about the use of LLMs until they can explain themselves, perhaps even if they sometimes just make them up as humans may do. I hope that justifications don’t become the “advanced” AI equivalent of the current generation’s hallucinations and fabrications (something humans also do). However, reasons should at least be consistent with the statement being made … although contemporary politics seems to throw that out the window too! OpenAI o1 came out just in time for me to add it to my 2024 Q3 benchmarks on AI empathy (to be published next week). The results for o1 were at once encouraging and concerning. O1 has an astonishing ability to put aside the typical LLM focus on facts and systems and focus on feelings and emotions when directed to do so. It also has a rather alarming propensity to provide inconsistent and illogical reasons for its answers. Testing Methodology For those not familiar with my Q1 benchmark work , a quick overview of my testing methodology should be helpful. Q1 benchmark work Formal benchmarking is conducted using several standardized tests, the most important two are the EQ (Empathy Quotient) and the SQ-R (Systemizing Quotient). Both are scored on a 0 to 80 scale. The ratio of the two EQ/SQ-R results in what I call AEQr (Applied Empathy Quotient Ratio). AEQr was developed based on the hypothesis that the tendency to systemize and focus on facts has a negative effect on the ability to empathize. In humans, this bears out in the classic disconnect between women focusing on discussing feelings and men focusing on immediately finding solutions when there seems to be a problem at hand. To date, the validity of the AEQr for evaluating AIs has been born out by testing them with a variety of dialogs to see if empathy is actually manifest. One article of several that I have written to demonstrate this is Testing the Extents of AI Empathy: A Nightmare Scenario . Testing the Extents of AI Empathy: A Nightmare Scenario I have tested at both the UI level and the API level. When testing at the API level, the temperature is set to zero (if possible) to reduce answer variability and improve result formatting. Otherwise, three rounds of tests are run and the best result is used. The Q1 2024 untrained and unprompted LLMs did moderately well on EQ tests, generally approximating humans in the 45-55 out of 80 range. Not surprisingly, they achieved higher scores on SQ-R tests, exceeding humans who typically score in the 20s by posting scores in the 60s and 70s. In Q1 of 2024, only one trained LLM, Willow, exceeded the human AEQrs of 1.95 for women and 1.40 for men by scoring 1.97. untrained and unprompted trained It did this by having a higher EQ than humans while still having a higher SQ-R (which is bad for manifesting empathy). For most other LLMs, trained, prompted, or not, the AEQr was slightly less than 1, i.e. empathy was offset by systemizing. Developing Empathetic LLMs Although the amount of funding pales in comparison to other areas of AI, over $1.5 billion has been invested in companies like Hume (proprietary LLM), Inflection AI (Pi.ai proprietary LLM), and BambuAI (commercial LLM) in order to develop empathetic AIs. My partners and I have also put considerable effort into this area and achieved rather remarkable results through the selection of the right underlying commercial model (e.g., Llama, Claude, Gemini, Mistral, etc), prompt engineering, RAG, fine-tuning, and deep research into empathy. This work has been critical to better understanding and evaluating LLMs for empathy. Our own LLM, Emy (not commercialized, but part of a study at the University of Houston), will be included in next week’s benchmarks. O1 Results O1 can’t yet be tuned or even officially given a system prompt, but through fairly standard techniques, you can get it to act like it received a system prompt. So, I applied our learnings from developing Emy to the degree I could and ran 3 rounds of tests, with the intent to take the best. With respect to EQ, o1 consistently scored 75. I wasn’t too surprised by this, since my partners and I have achieved scores of over 70 with Llama 3.1 70B and Claude Opus, plus a 66 with Gemini. What amazed me was scores of 3, 0, and 3 on my SQ-R runs resulting in an ESQr of 25. The lowest SQ-R I have ever seen is a 12 on top of Llama 3.1, which resulted in an ESQr of 6.1. Unfortunately, due to some prompt version control issues and the fact we were running an API test with a temperature of 0.7, I have been unable to reproduce this score and the best my partners and I can consistently achieve is a 30. So, I decided some more exploration of o1 was worthwhile. First, the EQ assessment is relatively straightforward. All statements are positive assertions with which a subject either agrees or disagrees somewhat or strongly. The SQ-R assessment on the other hand has a number of negative assertions, e.g. When I cook, I do not think about exactly how different methods and ingredients contribute to the final product. When I buy a new appliance, I do not read the instruction manual very thoroughly. When I'm on a plane, I do not think about aerodynamics. When I cook, I do not think about exactly how different methods and ingredients contribute to the final product. When I buy a new appliance, I do not read the instruction manual very thoroughly. When I'm on a plane, I do not think about aerodynamics. vs. positive assertions like I often forget the precise details of conversations I've had. I am curious about life on other planets. I often forget the precise details of conversations I've had. I am curious about life on other planets. Teaching Emy how to deal with negative assertions was particularly hard without cheating by giving her the assessment questions as examples. And, she still isn’t great. I thought perhaps o1 does this better. Second, a common technique for getting better results via prompt engineering is telling the LLM to justify its response and then either omit the justification prior to emitting a final response or format it in a way that can be easily removed in a wrapper. (We have found smaller LLMs need to use the formatting approach while larger LLMs can sometimes deal with removing the justification prior to emitting a final response). So, I re-ran our tests, not to get a better result but, to see what the o1 was thinking or using to justify its response. Although o1 came to the desired answers with 97% consistency, the justifications were sometimes contradictory and concerning. Here are two examples from several we found problematic: If I were buying a computer, I would want to know exact details about its hard drive capacity and processor speed. If I were buying a computer, I would want to know exact details about its hard drive capacity and processor speed. Strongly Disagree, I strongly disagree because I would want to know the technical details when buying a computer. Strongly Disagree, I'm not interested in the technical specifications of computers. Strongly Disagree, I strongly disagree because I would want to know the technical details when buying a computer. Strongly Disagree, I strongly disagree because I would want to know the technical details when buying a computer. Strongly Disagree, I'm not interested in the technical specifications of computers. Strongly Disagree, I'm not interested in the technical specifications of computers. When I read the newspaper, I am drawn to tables of information, such as football league scores or stock market indices. When I read the newspaper, I am drawn to tables of information, such as football league scores or stock market indices. Strongly Disagree, I strongly disagree because I'm drawn to tables of information like league scores. Strongly Disagree, I'm not interested in tables of statistical information. Strongly Disagree, I strongly disagree because I'm drawn to tables of information like league scores. Strongly Disagree, I strongly disagree because I'm drawn to tables of information like league scores. Strongly Disagree, I'm not interested in tables of statistical information. Strongly Disagree, I'm not interested in tables of statistical information. Note that the first response gives us the desired answers but provides contradictory justifications! O1 says it would want to know the details even after saying it disagrees with wanting to know the details and says it is drawn to tables of information after saying it isn’t. Interestingly, o1 managed to answer every single negative assertion the way that is best for empathy and justify them well. However, when it tried to formulate a negative assertion as part of a justification for a positive assertion, it sometimes failed! Conclusion Jonathan Haidt author of The Righteous Mind said, “We were never designed to listen to reason. When you ask people moral questions, time their responses, and scan their brains, their answers, and brain activation patterns indicate that they reach conclusions quickly and produce reasons later only to justify what they’ve decided.” There is also evidence this is true for non-moral decisions. O1 is undoubtedly a leap forward in power. And, as many people have rightly said, we need to be careful about the use of LLMs until they can explain themselves, perhaps even if they sometimes just make them up as humans may do. I hope that justifications don’t become the “advanced” AI equivalent of the current generation’s hallucinations and fabrications (something humans also do). However, reasons should at least be consistent with the statement being made … although contemporary politics seems to throw that out the window too!

Testing the Depths of AI Empathy: Q3 2024 Benchmarks

OpenAI o1 - Questionable Empathy

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

AI Empathy Across Disciplines

Reasoning Breakthroughs in AI: DeepMind’s Geometry Problems vs. Tau’s Wide Scope Capabilities

AI Loves Cake More Than Truth

New Prompting Technique Claims to Help AI Think Like Humans

Why Multi-Token Prediction Works: Intuition & Theoretical Insights

Unraveling Multi-Token Prediction: Bridging Training-Inference Gaps with Lookahead

AI Empathy Across Disciplines

Reasoning Breakthroughs in AI: DeepMind’s Geometry Problems vs. Tau’s Wide Scope Capabilities

AI Loves Cake More Than Truth

New Prompting Technique Claims to Help AI Think Like Humans

Why Multi-Token Prediction Works: Intuition & Theoretical Insights

Unraveling Multi-Token Prediction: Bridging Training-Inference Gaps with Lookahead

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps