Testing AI's Ability to Think Logically: A Breakthrough in Biomedical Research

by Large ModelsDecember 10th, 2024

Too Long; Didn't Read

This paper investigates syllogistic reasoning in LLMs, emphasizing gaps in biomedical domain applications. By introducing the SylloBio-NLI dataset, it highlights the need for specialized logical reasoning benchmarks.

featured image - Testing AI's Ability to Think Logically: A Breakthrough in Biomedical Research

Table of Links

A. Formalization of the SylloBio-NLI Resource Generation Process

B. Formalization of Tasks 1 and 2

C. Dictionary of gene and pathway membership

D. Domain-specific pipeline for creating NL instances and E Accessing LLMs

F. Experimental Details

G. Evaluation Metrics

H. Prompting LLMs - Zero-shot prompts

I. Prompting LLMs - Few-shot prompts

J. Results: Misaligned Instruction-Response

K. Results: Ambiguous Impact of Distractors on Reasoning

L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge

M Supplementary Figures and N Supplementary Tables

Syllogistic reasoning, which involves deriving valid conclusions from given premises based on formal logical structures, has long been a central focus in the study of Natural Language Inference (NLI). Early transformer models like BERT Devlin et al. [2019] and RoBERTa Liu et al. [2019] were trained on general NLI datasets, such as SNLI Bowman et al. [2015] and MNLI Williams et al. [2018], to address reasoning tasks. However, these models were limited in their ability to generalize abstract reasoning patterns, particularly in more specialized domains. Recent work has explored whether larger, decoder-based language models are capable of capturing syllogistic reasoning without task-specific fine-tuning, as reasoning itself is content-independent Bertolazzi et al. [2024].

Recent research, such as Liu et al. liu et al. [2023], has introduced benchmarks like GLoRE to test logical reasoning in LLMs, utilizing techniques such as Chain-of-Thought (CoT) prompting Wei et al. [2023] and in-context learning (ICL) Huang and Chang [2023]. These methods have been shown to improve reasoning performance, but issues remain, as models often rely on superficial patterns instead of deep logical comprehension. Eisape et al. Eisape et al. [2024] further demonstrated that even advanced models exhibit biases when handling syllogisms, particularly when faced with invalid syllogisms, often failing to generate "nothing follows" conclusions.

While the general-domain performance of LLMs in syllogistic reasoning has been explored Eisape et al. [2024], Dasgupta et al. [2024], research on syllogistic reasoning in the biomedical domain remains scarce. This paper addresses this gap by focusing specifically on biomedical applications, where logical reasoning over specialized knowledge is critical. We introduce a large-scale biomedical syllogism dataset, SylloBio-NLI, and provide a comprehensive evaluation of the syllogistic reasoning capabilities of state-of-the-art LLMs within this domain.

Authors:

(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;

(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;

(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;

(4) Marco Valentino, Idiap Research Institute, Switzerland;

(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.