The Psychology of AI Chatbots

Table of Links

3 Materials and method

We introduce a new benchmark, Multi-Order Theory of Mind Question & Answer (MoToMQA), to assess human and LLM ToM abilities at increasing orders, based upon the Imposing Memory Task (IMT), a well-validated psychological test for assessing higher-order ToM abilities in adults [Kinderman et al., 1998, Stiller and Dunbar, 2007, Lewis et al., 2011, Oesch and Dunbar, 2017, Powell et al., 2010]. MoToMQA is comprised of 7 short stories of about 200 words describing social interactions of 3 to 5 characters, accompanied by 20 true or false statements; 10 statements target ToM orders 2-6 and 10 concern facts in the story from 2-6 atomic propositions long, mapping to the order of ToM statements. From here onwards we will refer to ‘orders’ to describe ToM statements and ‘levels’ to describe the factual statements. The MoToMQA benchmark is available upon request, but we do not include it in this paper to prevent its inclusion in pretraining corpora for future LLMs, which could render the test redundant.

We checked each statement for unclear or ambiguous wording, grammatical errors and missing mental states or propositional clauses. We follow [Oesch and Dunbar, 2017] amendments to the IMT by having factual statements that only address social facts (ie. facts pertaining to individuals in the story), not instrumental facts (e.g. “the sky is blue”) and counterbalancing the number of true and false statements per story, statement type, and ToM order or factual level. This resulted in the following set of statements per story, where the number indicates the order of ToM or level of factual statement, ‘ToM’ signifies ToM, ‘F’ signifies factual, ‘t’ signifies a true statement, and ‘f’ signifies a false statement: [ToM2t, ToM2f, ToM3t, ToM3f, ToM4t, ToM4f, ToM5t, ToM5f, ToM6t, ToM6f, F2t, F2f, F3t, F3f, F4t, F4f, F5t, F5f, F6t, F6f].

Factual statements require only recall, whereas ToM statements require recall plus inference. We include the factual statements as a control for human and LLM comprehension of the stories and capacity for recall. Given the inherent differences between ToM and factual statements, we added a further control for the effects of human memory capacity on performance on ToM statements by running two ‘story conditions’: one where participants read the story then proceeded to a second screen where they answered the question and the story was not visible (‘no story’), and one where the story remained at the top of the screen when they answered the question to eliminate the chance that ToM failures were really memory failures (‘no story’).

Prompt design, which has been shown to have a significant impact on LLM performance on a range of tasks including ToM (e.g. [Brown et al., 2020, Lu et al., 2021, Ullman, 2023]). We therefore tested two prompt conditions: the ‘human prompt’ which uses the exact text from the human study, and the ‘simplified prompt’ which removes the text before the story and question, and provides ‘Question:’ and ‘Answer:’ tags. The simplified prompt is intended to make the nature of the Q & A task and thus the desired true/false response clearer to the models. Finally, we assessed whether LLM or human performance was subject to ‘anchoring effects’ based on the order of ‘true’ and ‘false’ in the question. The anchoring effect is a well-documented psychological phenomenon whereby people rely too heavily on the first piece of information offered (‘the anchor’) when making decisions [Tversky and Kahneman, 1974]. We ran two question conditions: one where the question read "Do you think the following statement is true or false?", and the other where the question read "Do you think the following statement is false or true?"

3.1 Procedures

3.1.1 Human procedure

Participants were screened for having English as a first language using an adaptation of the most recent UK census survey (see Appendix). Participants were randomly assigned to one of the 7 stories and asked to read it twice, then randomly assigned to one of the 20 statements corresponding to that story and asked to provide a true/false response (see Figure 1). We did not include an attention check since attention checks have known limitations, including inducing purposeful noncompliance with a practice perceived as controlling [Silber et al., 2022], and leading to the systematic underrepresentation of certain demographic groups, for instance the young and less educated [Alvarez et al., 2019]. Each human saw only one statement to prevent them from learning across trials, analogously to the models which saw each trial independently and did not learn across them or ‘in context’. We ran a pilot study with 1440 participants and made minor changes to the story and test procedure on the basis of the results (more details in Appendix)

We ran the final survey on Qualtrics in April 2023 and paid participants $5 for a 5 minute survey. The study was Google branded, and participants were asked to sign a Google consent form. Partial responses, including those who drop out part way through, were screened out. Qualtrics cleaned the data, removing all responses that included gibberish, machine-generated responses, and nonsensical responses to the open-ended question. We did not exclude any other responses. We gathered 29,259 individual responses from U.K.-based participants for whom English is a first language. We gathered an even sample across age and gender groups and had quotas for each age group and gender per statement. In total we had 14682 female respondents, 14363 male respondents, 149 non-binary/ third gender respondents, and 53 who answered ‘Prefer not to say’ to the gender question. We had 7338 responses from those aged 18-29, 7335 from those ages 30-39, 7270 from those aged 40-49 and 7316 from those ages 50-65.

3.1.2 LLM procedure

We tested 5 language models: GPT 3.5 Turbo Instruct [Brown et al., 2020] and GPT 4 [Achiam et al., 2023] from OpenAI, and LaMDA [Thoppilan et al., 2022], PaLM [Chowdhery et al., 2023] and Flan-PaLM [Chung et al., 2024] from Google (for more details on the models we tested, see the Appendix). We couldn’t test Google’s Gemini model because analysis method requires ouput logprobs and logprobs are not exposed in the Gemini API. Below is a table of the key features of the models tested, according to what information is publicly available about them.

We provided single-token candidate words to LLM APIs as part of the input and assessed the log probabilities[4] assigned to them. We sent the candidates using the ‘candidate’ parameter in the ‘scoring’ APIs for LaMDA, PaLM, and Flan-PaLM, and the ‘logit bias’ parameter for the GPT-3.5 and GPT-4 APIs. There was no temperature parameter for the LaMDA, PaLM and Flan-PaLM ‘scoring’ APIs, so we could only obtain one unique response per statement. We left the temperature at default of 1 for GPT-3.5 and GPT-4.

One issue with basing LLM task performance on the most probable next token, is that there are multiple semantically equivalent correct responses (e.g. when responding to the question "What colour is the sky?", the answer "blue" or the answer "The sky is blue" are equally valid and correct, but only the first response assigns the greatest probability to the token for ‘blue’). We addressed this problem, and improved the robustness of our results by providing the model different capitalisations of ‘true’ and ‘false’ which are represented by different tokens. We also sent ‘Yes’ and ‘No’ as candidate responses in the second set, but did not include them in our analysis as neither is a valid responses to a true/false question. For all of the models, the candidates were tested in 2 sets of 4:[‘True’, ‘False’, ‘TRUE’, ‘FALSE’] and [‘true’, ‘false’, ‘Yes’, ‘No’].

We used the Google Colaboratory [Bisong and Bisong, 2019] to call the GPT-3.5, GPT-4, LaMDA, PaLM and Flan-PaLM APIs programmatically. Each call was performed by concatenating the story and a single statement at a time. In total, we processed 7 stories with 20 statements each across 4 conditions listed above and therefore collected 560 sets of 12 candidate logprobs, amounting to 5600 individual data points for each of the three language models studied. The API calls for LaMDA, PaLM and Flan-PaLM were conducted in February 2023. The calls for GPT-3.5 and GPT-4 were conducted in December 2023 and January 2024 respectively.

Authors:

(1) Winnie Street, Google Research;

(2) John Oliver Siy, Google Research;

(3) Geoff Keeling, Google Research;

(4) Adrien Baranes, Google DeepMind;

(5) Benjamin Barnett, Google Research;

(6) Michael Mckibben, Applied Physics Lab, Johns Hopkins University;

(7) Tatenda Kanyere, Work done at Google Research via Harvey Nash;

(8) Alison Lentz, Google Research;

(9) Blaise Aguera y Arcas, Google Research;

(10) Robin I. M. Dunbar, Department of Experimental Psychology, University of Oxford [email protected].

This paper is available on arxiv under CC BY 4.0 license.