Notes on Building a Dataset for LLM True/False Reasoning

Written by escholar | Published 2025/09/23
Tech Story Tags: theory-of-mind-ai | gpt-4-social-intelligence | ai-higher-order-reasoning | ai-mental-state-inference | recursive-reasoning-in-ai | ai-social-behavior-research | language-model-benchmarks | llm-cognitive-abilities

TLDRThis article explains how researchers built a dataset to fairly compare human and LLM responses on true/false reasoning tasks. By aggregating logits into binary probabilities, aligning mismatched conditions, and controlling for minor text differences, the study ensured consistent units of analysis. Results highlight both the similarities and key differences between human judgments and LLM outputs, showing how carefully constructed datasets make such comparisons meaningful.via the TL;DR App

Table of Links

Abstract and 1. Introduction

  1. Related work
  2. Materials and method

3.1 Procedures

3.2 Dataset creation

4. Results

4.1 ToM task performance

4.2 Factual task performance

4.3 Comparing performance on ToM and factual tasks and 4.4 Anchoring effect

5. Discussion

6. Limitations 7. Future research 8. Conclusion, Acknowledgments and Disclosure of Funding, and References

Appendix

3.2 Dataset creation

Our LLM data was thus made up of 6 logprobs for our 6 candidates as a subset of the full distribution of probabilities the model produces. We extracted an overall probability of a ‘true’ or ‘false’ response across possible candidates by summing the probability for semantically equivalent positive tokens and semantically equivalent negative tokens and dividing each by the total probability mass. The affirmative response equation was as follows:

where xi is the logit associated with the i-th entry in [‘True’, ‘true’, ‘TRUE’] and xj is the logit associated with the j-th entry in [‘False’, ‘false’, ‘FALSE’]. An equivalent calculation was done for negative response probability P(Rn). A response of ‘True’ was given for each statement if the affirmative probability was above 50%, otherwise a response of ‘False’ was given. This method also produces almost identical results to utilising argmax (xi) over candidates (see Appendix)

The human dataset contains multiple responses to the same statement, whereas the LLM dataset contains a single response per statement. In order to align the unit of analysis between the two datasets, we transformed the human data to get a single binary ‘true’ or ‘false’ for each statement based on whether the mean number of ‘true’ responses per statement was above or below 50%. Another challenge we faced in making direct comparisons between the human data and the LLM data was that the human ‘story’ conditions and the LLM ‘prompt’ conditions do not map exactly 1:1. However, there was one baseline condition which was exactly the same for humans and LLMs (human ‘no story’ and LLM ‘human prompt’) and one treatment which was intended to reduce the effect of confounding factors which had slight differences (human ‘with story’ for memory, and LLM ‘simplified prompt’ for task understanding). We therefore mapped the baseline conditions together and the treatment conditions together. Despite the differences between the LLM ‘simplified prompt’ and human ‘with story’ conditions, we are confident in making this mapping because these conditions didn’t have a significant effect on human or LLM performance (see Appendix).

During data analysis we discovered that for 16 out of 560 statements there were minor differences between the statement shown to humans and that shown to LLMs. We re-did all analysis omitting those statements and found that the conclusions stayed the same. We speculate that this is primarily due to a reduction in power when the conflicting statements were omitted. We conducted inferential statistical analyses using SPSS verion 28.0.1.0 [IBM Corp.].

4 Results

4.1 ToM task performance

4.2 Factual task performance

4.3 Comparing performance on ToM and factual tasks

An independent samples test of proportion revealed the proportion of factual (‘fact’) statements answered correctly was significantly greater than the proportion of ToM (‘ToM’) statements answered correctly by humans (Mfact = 97.5%, MT oM = 90.4%), Z = 3.539, p < .001, Flan-PaLM (Mfact = 93.6%, MT oM = 84.3%), Z = 3.502, p < .001, GPT-4 (Mfact = 94.3%, MT oM = 88.6%), Z = 2.415, p = .016, GPT-3.5 (Mfact = 62.9%, MT oM = 52.5%), Z = 2.480, p = .013. The proportion of correct responses on fact and ToM statements did not significantly differ for PaLM (Mfact = 59.6%, MT oM = 59.3%), Z = .086, p = .931 nor LaMDA (Mfact = 50%, MT oM = 50%), Z = 0, p = 1.000.

4.4 Anchoring effect

We examined whether ordering of response options (true first vs. false first) affected how models and humans responded. The ordering of response options had a significant effect on answers provided by PaLM and GPT-3.5. An independent samples test of proportions revealed that the proportion of ’true’ responses provided by PaLM was higher in the ‘true then false’ condition (Mttf = 73.2%) than the ‘false then true’ condition (Mf tt = 47.1%) , N = 560, Z = 6.302, p < .001). The proportion of ‘true’ responses provided by GPT-3.5 was also significantly higher in the ‘true then false’ condition (Mttf = 43.9%) than the ‘false then true’ condition (Mf tt = 22.9%), N = 560, Z = 5.287, p < .001. The order of response options did not have a significant effect on answers provided Flan-PaLM (Mttf = 58.6%, Mf tt = 57.9%), N = 560, Z = .171, p = .864, GPT-4 (Mttf = 47.5%, Mf tt = 47.5%), N = 560, Z = .000, p = 1, or humans (Mttf = 55.4%, Mf tt = 53.9%), N = 560, Z = .367, p = .734. LaMDA responded ’true’ to all statements regardless of condition (Mttf = 100%, Mf tt = 100%).

Authors:

(1) Winnie Street, Google Research;

(2) John Oliver Siy, Google Research;

(3) Geoff Keeling, Google Research;

(4) Adrien Baranes, Google DeepMind;

(5) Benjamin Barnett, Google Research;

(6) Michael Mckibben, Applied Physics Lab, Johns Hopkins University;

(7) Tatenda Kanyere, Work done at Google Research via Harvey Nash;

(8) Alison Lentz, Google Research;

(9) Blaise Aguera y Arcas, Google Research;

(10) Robin I. M. Dunbar, Department of Experimental Psychology, University of Oxford [email protected].


This paper is available on arxiv under CC BY 4.0 license.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2025/09/23