Do Large Language Models Have Theory of Mind? A Benchmark Study

Written by escholar | Published 2025/09/24
Tech Story Tags: theory-of-mind-ai | gpt-4-social-intelligence | ai-higher-order-reasoning | ai-mental-state-inference | recursive-reasoning-in-ai | ai-social-behavior-research | language-model-benchmarks | llm-cognitive-abilities

TLDRThis article evaluates whether advanced language models like GPT-4 and Flan-PaLM demonstrate Theory of Mind (ToM)—the ability to reason about others’ beliefs, intentions, and emotions. While results show GPT-4 sometimes matches or even exceeds adult human performance on 6th-order ToM tasks, limitations remain: the benchmark is small, English-only, and excludes multimodal signals that shape real human cognition. Future research must expand across cultures, languages, and embodied interactions to truly test AI’s capacity for mind-like reasoning.via the TL;DR App

Table of Links

Abstract and 1. Introduction

  1. Related work
  2. Materials and method

3.1 Procedures

3.2 Dataset creation

4. Results

4.1 ToM task performance

4.2 Factual task performance

4.3 Comparing performance on ToM and factual tasks and 4.4 Anchoring effect

5. Discussion

6. Limitations 7. Future research 8. Conclusion, Acknowledgments and Disclosure of Funding, and References

Appendix

6 Limitations

Our benchmark is limited in scope and size, comprising 140 test statements, all written in English, going up to a maximum of 6 orders of ToM. Only using English obscures potential linguistic and cultural variations in human ToM, and prohibits assessment of LLM ToM as exhibited in other languages the models are able to produce. The size of the test suite limits the generalisability of our findings. Only going up to 6th-order ToM does not appear to have exhausted LLM or human capacities. We also didn’t control for the type or cognitive (e.g. thinking, knowing) or affective (e.g. feeling) states involved in the statements, which we would like to address in future work.

7 Future research

We propose three areas for future work. First, developing culturally diverse and comprehensive benchmarks which include multiple languages and parameterise cognitive and affective states to capture potential differences between LLM ability to reason about them. Secondly, the test suite should be extended beyond 6th order ToM to find the limits of both human and LLM orders of ToM. Finally, future work on LLM ToM should adopt multimodal paradigms (including signals like facial expressions, gaze, and tone of voice) that reflect the embodied nature of human ToM.

8 Conclusion

We have shown that GPT-4 and Flan-PaLM exhibit higher-order ToM that is at the level of adult humans or slightly below, while smaller and non-finetuned models have limited to no capacity for higher-order ToM. We also find that GPT-4 has better-than-human performance on 6th-order ToM tasks. Given the novelty of the test suite, the fact that higher-order ToM is unlikely to be wellrepresented in textual pretraining data, and evidence that these two models were not susceptible to perturbations of the prompt, we interpret these findings as evidence that GPT-4 and Flan-PaLM have developed ToM reasoning abilities that go beyond manipulation of superficial statistical relationships. However, we refrain from drawing a strong conclusion about whether or not LLM performance on these tasks is an indication of the cognitive ability we call ‘Theory of Mind’. LLM and human developmental processes differ greatly and LLMs do not have the evolutionary pressure to model other minds which humans appear to face as a result of embodiment in a social world. However, as others have noted [Mitchell and Krakauer, 2023, y Arcas, 2022], we may have to recognise LLM behaviours that are functionally-equivalent to those of humans as evidence of a new kind of understanding that cannot be reduced to "spurious" correlation. This recognition may in turn lead to more parsimonious explanations of their performance on cognitive tasks and enhance our ability to assess the potential risks and benefits that advanced LLM capabilities present.

Acknowledgments and Disclosure of Funding

We thank Reed Enger (Google Research), Tong Wu (Google Research), Saige McVea (Google Research), Paulina Mustafa (Google Research) and Yeawon Choi (Google Research) for their help developing the stories and statements. This research was funded by Google.

References

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

R Michael Alvarez, Lonna Rae Atkeson, Ines Levin, and Yimeng Li. Paying attention to inattentive survey respondents. Political Analysis, 27(2):145–162, 2019.

Simon Baron-Cohen, Alan M Leslie, and Uta Frith. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46, 1985.

Ekaba Bisong and Ekaba Bisong. Google colaboratory. Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners, pages 59–64, 2019.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

Lauretta SP Cheng, Danielle Burgess, Natasha Vernooij, Cecilia Solís-Barroso, Ashley McDermott, and Savithry Namboodiripad. The problematic concept of native speaker in psycholinguistics: Replacing vague and harmful terminology with inclusive and accurate measures. Frontiers in psychology, 12:715843, 2021.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.

Michael C Corballis. The evolution of language. 2017.

Harmen De Weerd, Rineke Verbrugge, and Bart Verheij. Negotiating with other minds: the role of recursive theory of mind in negotiation with incomplete information. Autonomous Agents and Multi-Agent Systems, 31:250–287, 2017.

Harmen De Weerd, Rineke Verbrugge, and Bart Verheij. Higher-order theory of mind is especially useful in unpredictable negotiations. Autonomous Agents and Multi-Agent Systems, 36(2):30, 2022.

Robin IM Dunbar. The social brain: mind, language, and society in evolutionary perspective. Annual review of Anthropology, 32(1):163–181, 2003.

Seliem El-Sayed, Canfer Akbulut, Amanda McCroskery, Geoff Keeling, Zachary Kenton, Zaria Jalan, Nahema Marchal, Arianna Manzini, Toby Shevlane, Shannon Vallor, et al. A mechanism-based approach to mitigating harms from persuasive generative ai. arXiv preprint arXiv:2404.15058, 2024.

Camila Fernández. Mindful storytellers: Emerging pragmatics and theory of mind development. First Language, 33(1):20–46, 2013.

Iason Gabriel, Arianna Manzini, Geoff Keeling, Lisa Anne Hendricks, Verena Rieser, Hasan Iqbal, Nenad Tomašev, Ira Ktena, Zachary Kenton, Mikel Rodriguez, et al. The ethics of advanced ai assistants. arXiv preprint arXiv:2404.16244, 2024.

Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36, 2024.

Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755, 2023.

Fritz Heider. Attitudes and cognitive organization. The Journal of psychology, 21(1):107–112, 1946.

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.

Christine I Hooker, Sara C Verosky, Laura T Germine, Robert T Knight, and Mark D’Esposito. Mentalizing about emotion and its relationship to empathy. Social cognitive and affective neuroscience, 3(3):204–217, 2008.

Nicholas K Humphrey. The social function of intellect. 1976.

Janet S Hyde and Marcia C Linn. Gender differences in verbal ability: A meta-analysis. Psychological bulletin, 104(1):53, 1988.

IBM Corp. Released 2021. IBM SPSS Statistics for Windows, Version 28.0.1.0. Armonk, NY: IBM Corp.

Boaz Keysar, Shuhong Lin, and Dale J Barr. Limits on theory of mind use in adults. Cognition, 89 (1):25–41, 2003.

Peter Kinderman, Robin Dunbar, and Richard P Bentall. Theory-of-mind deficits and causal attributions. British journal of Psychology, 89(2):191–204, 1998.

Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.

Jonathan D Lane, Henry M Wellman, Sheryl L Olson, Jennifer LaBounty, and David CR Kerr. Theory of mind and emotion understanding predict moral development in early childhood. British Journal of Developmental Psychology, 28(4):871–889, 2010.

Penelope A Lewis, Roozbeh Rezaie, Rachel Brown, Neil Roberts, and Robin IM Dunbar. Ventromedial prefrontal volume predicts understanding of others and social network size. Neuroimage, 57 (4):1624–1629, 2011.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.

Bertram F Malle. How the mind explains behavior. Folk explanation, Meaning and social interaction. Massachusetts: MIT-Press, 2004.

Patrick McGuiness. Gpt-4 details revealed. 12 July 2023. URL https://patmcguinness. substack.com/p/gpt-4-details-revealed.

Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023.

Steven Mithen. The prehistory of the mind: The cognitive origins of art and science. Thames & Hudson Ltd., 1996. N

Nathan Oesch and Robin IM Dunbar. The emergence of recursion in human language: Mentalising predicts recursive syntax task performance. Journal of Neurolinguistics, 43:95–106, 2017.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730– 27744, 2022.

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.

Tuan Minh Pham, Jan Korbel, Rudolf Hanel, and Stefan Thurner. Empirical social triad statistics can be explained with dyadic homophylic interactions. Proceedings of the National Academy of Sciences, 119(6):e2121103119, 2022.

Joanne L Powell, Penelope A Lewis, Robin IM Dunbar, Marta García-Fiñana, and Neil Roberts. Orbital prefrontal cortex volume correlates with social cognitive competence. Neuropsychologia, 48(12):3554–3562, 2010.

David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526, 1978.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.

Maarten Sap, Ronan LeBras, Daniel Fried, and Yejin Choi. Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312, 2022.

Brenda Schick, Peter De Villiers, Jill De Villiers, and Robert Hoffmeister. Language and theory of mind: A study of deaf children. Child development, 78(2):376–396, 2007.

Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763, 2023.

Henning Silber, Joss Roßmann, and Tobias Gummer. The issue of noncompliance in attention check questions: False positives in instructed response items. Field Methods, 34(4):346–360, 2022.

James Stiller and Robin IM Dunbar. Perspective-taking and memory capacity predict social network size. Social Networks, 29(1):93–104, 2007.

Winnie Street. LLM theory of mind and alignment: Opportunities and risks. arXiv preprint arXiv:2405.08154, 2024.

Jon Sutton, Peter K Smith, and John Swettenham. Bullying and ‘theory of mind’: A critique of the ‘social skills deficit’view of anti-social behaviour. Social development, 8(1):117–127, 1999a.

Jon Sutton, Peter K Smith, and John Swettenham. Social cognition and bullying: Social inadequacy or skilled manipulation? British journal of developmental psychology, 17(3):435–450, 1999b.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.

Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, 185(4157):1124–1131, 1974.

Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399, 2023.

Annalisa Valle, Davide Massaro, Ilaria Castelli, and Antonella Marchetti. Theory of mind development in adolescence and early adulthood: The growing complexity of recursive thinking ability. Europe’s journal of psychology, 11(1):112, 2015.

Max J van Duijn, Bram van Dijk, Tom Kouwenhoven, Werner de Valk, Marco R Spruit, and Peter van der Putten. Theory of mind in large language models: Examining performance of 11 stateof-the-art models vs. children aged 7-10 on advanced tests. arXiv preprint arXiv:2310.20320, 2023.

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents, 2023.

Henry M Wellman and Karen Bartsch. Young children’s reasoning about beliefs. Cognition, 30(3): 239–277, 1988.

Henry M Wellman, David Cross, and Julanne Watson. Meta-analysis of theory-of-mind development: The truth about false belief. Child development, 72(3):655–684, 2001.

Heinz Wimmer and Josef Perner. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128, 1983.

Blaise Agüera y Arcas. Do large language models understand us? Daedalus, 151(2):183–197, 2022.

Authors:

(1) Winnie Street, Google Research;

(2) John Oliver Siy, Google Research;

(3) Geoff Keeling, Google Research;

(4) Adrien Baranes, Google DeepMind;

(5) Benjamin Barnett, Google Research;

(6) Michael Mckibben, Applied Physics Lab, Johns Hopkins University;

(7) Tatenda Kanyere, Work done at Google Research via Harvey Nash;

(8) Alison Lentz, Google Research;

(9) Blaise Aguera y Arcas, Google Research;

(10) Robin I. M. Dunbar, Department of Experimental Psychology, University of Oxford [email protected].


This paper is available on arxiv under CC BY 4.0 license.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2025/09/24