Can AI Think About Thinking?

Written by escholar | Published 2025/09/23
Tech Story Tags: theory-of-mind-ai | gpt-4-social-intelligence | ai-higher-order-reasoning | ai-mental-state-inference | recursive-reasoning-in-ai | ai-social-behavior-research | language-model-benchmarks | llm-cognitive-abilities

TLDRThis article explores whether large language models (LLMs) like GPT-4 and Flan-PaLM can demonstrate higher-order Theory of Mind (ToM)—the ability to reason recursively about others’ beliefs and intentions. Using a new benchmark called Multi-Order Theory of Mind Q&A (MoToMQA), the study finds that GPT-4 achieves adult-level ToM performance and even surpasses humans on 6th-order reasoning tasks. These results suggest that LLMs are developing generalized ToM abilities, with important implications for AI’s role in human cooperation, competition, and user-facing applications.via the TL;DR App

Authors:

(1) Winnie Street, Google Research;

(2) John Oliver Siy, Google Research;

(3) Geoff Keeling, Google Research;

(4) Adrien Baranes, Google DeepMind;

(5) Benjamin Barnett, Google Research;

(6) Michael Mckibben, Applied Physics Lab, Johns Hopkins University;

(7) Tatenda Kanyere, Work done at Google Research via Harvey Nash;

(8) Alison Lentz, Google Research;

(9) Blaise Aguera y Arcas, Google Research;

(10) Robin I. M. Dunbar, Department of Experimental Psychology, University of Oxford [email protected].

Table of Links

Abstract and 1. Introduction

  1. Related work
  2. Materials and method

3.1 Procedures

3.2 Dataset creation

4. Results

4.1 ToM task performance

4.2 Factual task performance

4.3 Comparing performance on ToM and factual tasks and 4.4 Anchoring effect

5. Discussion

6. Limitations 7. Future research 8. Conclusion, Acknowledgments and Disclosure of Funding, and References

Appendix

Abstract

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite – Multi-Order Theory of Mind Q&A – and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

1 Introduction

Theory of Mind (ToM) is the ability to infer and reason about the mental states of oneself and others [Premack and Woodruff, 1978, Wimmer and Perner, 1983, Wellman et al., 2001]. ToM is central to human social intelligence: it enables humans to predict and influence behaviour [Humphrey, 1976, Wellman and Bartsch, 1988, Hooker et al., 2008].

Large Language Models (LLMs) exhibit some ToM competency [Kosinski, 2023, Bubeck et al., 2023, Shapira et al., 2023]. Most of the literature on LLM ToM has focused on 2nd-order ToM [Sap et al., 2022, Kosinski, 2023, Gandhi et al., 2024, Shapira et al., 2023], where the ‘order of intentionality’ (hereafter, ‘order’) is the number of mental states involved in a ToM reasoning process (i.e. a third-order statement is “I think you believe that she knows”). Yet LLMs are increasingly leveraged for multi-party social interaction contexts which require LLMs to engage in higher order ToM reasoning [Wang et al., 2023, Park et al., 2023].

In this paper, we examine LLM ToM from orders 2-6. We introduce a novel benchmark: Multi-Order Theory of Mind Question & Answer (MoToMQA). MoToMQA is based on a ToM test designed for human adults [Kinderman et al., 1998], and involves answering true/false questions about characters in short-form stories. We assess how ToM order affects LLM performance, how LLM performance compares to human performance, and how LLM performance on ToM tasks compares to performance on factual tasks of equivalent syntactic complexity. We show that GPT-4 and Flan-PaLM reach at-human or near-human performance on ToM tasks respectively.

2 Related work

2.1 Higher-order ToM

Human adults are generally able to make ToM inferences up to 5 orders of intentionality (e.g. I believe that you think that I imagine that you want me to believe) [Kinderman et al., 1998, Stiller and Dunbar, 2007, Oesch and Dunbar, 2017].[1] Higher-order ToM competency varies within the population, including by gender [Hyde and Linn, 1988, Stiller and Dunbar, 2007], and is not deployed reliably across all social contexts [Keysar et al., 2003]. ToM at higher orders is also positively correlated with social complexity. Tracking the beliefs and desires of multiple individuals at once facilitates group negotiations, group bonding, and distinctly human behaviours and cultural institutions, including humour, religion and storytelling [Corballis, 2017, Dunbar, 2003, Fernández, 2013].

2.2 LLM ToM

Kosinski [2023] argued for spontaneous ToM emergence in LLMs based on GPT-4’s success on a suite of tasks inspired by the classic Sally-Anne task.[2] Ullman [2023] challenged this claim, demonstrating decreased performance with minor task perturbations. Further experiments involving benchmark suites like BigToM [Gandhi et al., 2024] and SocialIQa [Sap et al., 2022] show mixed results in LLM ToM capabilities. For example, Shapira et al. [2023] found success on some tasks but failure on others, suggesting that existing ToM capabilities in current state-of-the-art LLMs are not robust. To our knowledge only two other studies have explored LLM ToM at higher orders. He et al. [2023] assessed orders 0-4 (equivalent to our orders 2-5) and van Duijn et al. [2023] compared LLM performance with that of children aged 7-10 on two stories adapted from unpublished IMT stories. Our study adds to this work by testing one higher order than He et al. [2023], by utilising a larger, and entirely new set of handwritten stories and statements that we are certain models were not exposed to during pretraining[3] and by using log probabilities (logprobs) outputted for candidate tokens as the measure of the LLMs’ preferred responses. Using logprobs adds robustness to our data because it takes into account multiple ways in which the model could provide the correct response.

Finally, we calibrate the LLM results against a large newly-gathered adult human benchmark. We believe that comparing LLM performance to that of adults, rather than children, is the most relevant yardstick for LLM social intelligence given that LLMs’ primary interaction partners will be adults, and is a more concrete point of comparison because human higher-order ToM capacities continue to develop into early adulthood [Valle et al., 2015]. We do not, however, assume that the same cognitive processes underpin human and LLM performance on psychological tests.

This paper is available on arxiv under CC BY 4.0 license.

[1] We follow the naming convention for orders developed for the IMT where the ‘1st-order’ is the mental state of the subject whose ToM ability is being assessed, the ‘2nd-order’ is the subject’s inference about what someone else thinks or feels, and so-on. By contrast, some scholars begin at ‘0-order’ for the subject’s mental state. Where our convention conflicts with others referenced, we make it explicit.

[2] The ‘Sally Ann task’, originally devised by Baron-Cohen et al. [1985] measures false belief understanding and follows a scenario where a character, Sally, places an object in a location and leaves the scene. While Sally is absent Anne moves the object. Upon Sally’s return the child is asked where Sally will search for the object, testing their ability to attribute a false belief to Sally despite themselves knowing the object’s true location.

[3] Pretraining datasets being contaminated with materials that LLMs are later tested on is a live issue in LLM research which has significant implications for the results of LLMs on benchmarks. For example, OpenAI reported that they found parts of the BigBench dataset contaminating the GPT-4 pretraining corpora in a contamination check of the dataset used to pretrain GPT-4 [Achiam et al., 2023]


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2025/09/23