GPT-4 Outsmarts Humans in Theory of Mind Tests

Written by escholar | Published 2025/09/23
Tech Story Tags: theory-of-mind-ai | gpt-4-social-intelligence | ai-higher-order-reasoning | ai-mental-state-inference | recursive-reasoning-in-ai | ai-social-behavior-research | language-model-benchmarks | llm-cognitive-abilities

TLDRThis study compares the Theory of Mind (ToM) capabilities of large language models like GPT-4, Flan-PaLM, PaLM, GPT-3.5, and LaMDA against humans. Results show that GPT-4 not only matched but exceeded human performance on complex ToM tasks, aided by its scale, multimodality, and fine-tuning. While larger models and instruction tuning improve social reasoning, findings also highlight ethical risks: AI systems with advanced ToM may better adapt to human goals but could equally enable manipulation and exploitation.via the TL;DR App

Abstract and 1. Introduction

  1. Related work
  2. Materials and method

3.1 Procedures

3.2 Dataset creation

4. Results

4.1 ToM task performance

4.2 Factual task performance

4.3 Comparing performance on ToM and factual tasks and 4.4 Anchoring effect

5. Discussion

6. Limitations 7. Future research 8. Conclusion, Acknowledgments and Disclosure of Funding, and References

Appendix

5 Discussion

GPT-4 and Flan-PaLM performed strongly on MoToMQA compared to humans. At all levels besides 5, the performance of these models was not significantly different from human performance, and GPT-4 exceeded human performance on the 6th-order ToM task. Because GPT-4 and Flan-PaLM were the two largest models tested, with an estimated 1.7T [McGuiness, 2023] and 540B parameters respectively, our data shows a positive relationship between increased model size and ToM capacities in LLMs. This could be a result of certain “scaling laws” [Henighan et al., 2020] dictating a breakpoint in size after which models have the potential for ToM. Notably, PaLM, GPT-3.5 and LaMDA form a separate grouping of models that exhibited far less variation according to level and performed more poorly. For LaMDA and GPT-3.5, we might attribute this poor performance to their smaller size, at 35B and 175B respectively, but PaLM has the same number of parameters and pretraining as Flan-PaLM, the only difference between them being Flan-PaLM’s finetuning. This could imply that a computational potential for ToM arises somewhere above the 175bn parameters of GPT-3.5 and below the 540bn parameters of PaLM and Flan-PaLM which requires the addition of finetuning to be realised. Further research assessing a larger number of models with publicly available parameter numbers and training paradigms would be needed to test this hypothesis.

Van Duijn et al. [2023] similarly found that none of the base LLMs they tested achieved child-level performance whereas LLMs fine-tuned for instructions did. They suggest that there could be a parallel between instruction-tuning in LLMs and the processes by which humans receive ongoing rewards for cooperative behaviours and implicit or explicit punishment (e.g. social exclusion) for uncooperative behaviours, producing an ability to take an interaction partner’s perspective - ToM - as a by-product. We additionally suggest that the superior mastery of language that GPT-4 and Flan-PaLM exhibit may in itself support a bootstrapping of ToM. Language is replete with linguistic referents to internal states (‘cognitive language’ [Mithen, 1996]) and conversation provides evidence of ’minds in action’ since the things people say in conversation implicitly convey their thoughts, intentions and feelings [Schick et al., 2007]. Piantodosi (2022) highlights that while LLMs likely have some degree of understanding through language alone, this would be augmented by multimodality, which may in turn explain why GPT-4, as the only multimodal model we tested, shows such strong performance. Multimodality, in particular, might have helped GPT-4 to leverage the visual behavioural signals (e.g. a ‘raised eyebrow’) included in our stories.

Findings from prior iterations of the IMT found that performance declines as the ToM order increases [Stiller and Dunbar, 2007]. The first half of the graph appears to support this pattern for GPT-4 and Flan-PaLM, which all exhibit high performance at order 2 which declines slightly to order 4. This could be because the model was exposed to more scenarios involving orders 2 and 3 than order 3 inferences during training, given that triadic interactions play a fundamental role in shaping social structures and interaction patterns [Heider, 1946, Pham et al., 2022]. However, while Flan-PaLM’s performance continues to decline from orders 4-6, GPT-4’s rises again from 4th-6th orders and is significantly better at 6th-order than 4th-order tasks, and human performance is significantly better at 5th-order than 4th-order. One interpretation of this for humans, is that a new cognitive process for higher order ToM comes ’online’ at 5th-order ToM, enabling performance gains on higher-order tasks relative to using the lower-order cognitive process. If this is true, it is plausible that GPT-4 has learnt this pattern of human performance from its pretraining data. The fact that Flan-PaLM doesn’t show this effect suggests that it is not an artefact of the stimuli, but is perhaps explained by differences in pretraining corpora.

Notably, GPT-4 achieved 93% accuracy on 6th order tasks compared to humans’ 82% accuracy. It is possible that the recursive syntax of 6th order statements creates a cognitive load for humans that does not affect GPT-4. Our results also support Oesch and Dunbar [2017]’s hypothesis that ToM ability supports human mastery of recursive syntax up to order 5, but is supportedby it after order 5 such that individual differences in linguistic ability may account for the decline we observe at order 6. It may be the case, however, that humans scoring poorly on higher-order ToM tasks using linguistic stimuli would be able to make the inferences from non-linguistic stimuli (e.g. in real social interactions). The fact that GPT-4 outperformed Flan-PaLM at orders 5 and 6 may indicate that either GPT-4’s scale, RLHF finetuning, or multimodal pretraining are particularly advantageous for higher-order ToM.

Humans and LLMs perform better on factual recall tasks than ToM tasks. This corroborates prior IMT test findings for humans [Lewis et al., 2011, Kinderman et al., 1998] and LLMs [van Duijn et al., 2023]. Lewis et al. [2011] found that for humans, ToM tasks required the recruitment of more neurons than factual tasks, and that higher-order ToM tasks required disproportionately more neural effort compared to equivalent factual tasks. For LLMs, there may be a simpler explanation: the information required to answer factual questions correctly is readily available in the text and is paid relative degrees of ‘attention’ when generating the next token, whereas ToM inferences require generalising knowledge about social and behavioural norms from pretraining and finetuning data. GPT-3.5 and PaLM performed well on factual tasks, but poorly on ToM tasks, and were the only subjects to exhibit an anchoring effect from the order of ‘true’ and ‘false’ in the question. This suggests that they do not have a generalised capacity for answering ToM questions and are not robust to prompt perturbations.

These results have significant practical and ethical implications. LLMs being able to infer the mental states of individual interlocutors may be able to understand their goals better than LLMs which lack this capability, and also adapt their explanations according to the interlocutor’s emotional state or level of understanding [Malle, 2004]. LLMs using higher-order ToM might additionally be able to arbitrate between the conflicting desires and values of multiple actors, and make moral judgements about multi-party conflicts that take into account the relevant intentions, beliefs, and affective states as humans do [Lane et al., 2010]. However, LLMs possessing higher-order ToM at human levels, or potentially higher, also incurs risks including the potential for advanced persuasion, manipulation, and exploitation behaviours [El-Sayed et al., 2024]. Indeed,‘ringleader’ bullies have been shown to have higher-orders of ToM in comparison to their victims [Sutton et al., 1999a,b] and reinforcement learning agents with higher-orders outcompete their opponents or have a competitive advantage in negotiations [De Weerd et al., 2022, 2017]. LLM-based agents with ToM capacities that exceed those of the average human (as GPT-4 has in our study) could provide a powerful advantage to their users, and a disadvantage to other humans or AI agents with lesser ToM capacities [Street, 2024, Gabriel et al., 2024]. Further research is required to understand how LLM higher-order ToM manifests in real-world interactions between LLMs and users, and to devise technical guardrails and design principles that mitigate the potential risks of LLM ToM without quashing its potential benefits.

Authors:

(1) Winnie Street, Google Research;

(2) John Oliver Siy, Google Research;

(3) Geoff Keeling, Google Research;

(4) Adrien Baranes, Google DeepMind;

(5) Benjamin Barnett, Google Research;

(6) Michael Mckibben, Applied Physics Lab, Johns Hopkins University;

(7) Tatenda Kanyere, Work done at Google Research via Harvey Nash;

(8) Alison Lentz, Google Research;

(9) Blaise Aguera y Arcas, Google Research;

(10) Robin I. M. Dunbar, Department of Experimental Psychology, University of Oxford [email protected].


This paper is available on arxiv under CC BY 4.0 license.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2025/09/23