Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.
A. Formalization of the SylloBio-NLI Resource Generation Process
B. Formalization of Tasks 1 and 2
C. Dictionary of gene and pathway membership
D. Domain-specific pipeline for creating NL instances and E Accessing LLMs
H. Prompting LLMs - Zero-shot prompts
I. Prompting LLMs - Few-shot prompts
J. Results: Misaligned Instruction-Response
K. Results: Ambiguous Impact of Distractors on Reasoning
L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge
M Supplementary Figures and N Supplementary Tables
Syllogistic reasoning is crucial for Natural Language Inference (NLI). This capability is particularly significant in specialized domains such as biomedicine, where it can support automatic evidence interpretation and scientific discovery. This paper presents SylloBio-NLI[1], a novel framework that leverages external ontologies to systematically instantiate diverse syllogistic arguments for biomedical NLI. We employ SylloBio-NLI to evaluate Large Language Models (LLMs) on identifying valid conclusions and extracting supporting evidence across 28 syllogistic schemes instantiated with human genome pathways. Extensive experiments reveal that biomedical syllogistic reasoning is particularly challenging for zero-shot LLMs, which achieve an average accuracy between 70% on generalized modus ponens and 23% on disjunctive syllogism. At the same time, we found that few-shot prompting can boost the performance of different LLMs, including Gemma (+14%) and LLama-3 (+43%). However, a deeper analysis shows that both techniques exhibit high sensitivity to superficial lexical variations, highlighting a dependency between reliability, models’ architecture, and pre-training regime. Overall, our results indicate that, while in-context examples have the potential to elicit syllogistic reasoning in LLMs, existing models are still far from achieving the robustness and consistency required for safe biomedical NLI applications.
Syllogistic reasoning – i.e., the process of deriving valid conclusions from premises through the systematic application of abstract reasoning schemes – is a fundamental type of inference for developing and evaluating Natural Language Inference (NLI) models that can reason over textual evidence at scale MacCartney and Manning [2009], Wu et al. [2023]. within specialised domains such as biomedicine, the ability to reason over natural language can have significant practical implications for supporting complex discourse interpretation, scientific discovery, and the development of downstream biomedical and clinical applications Jullien et al. [2023a,b], allowing for a set of formally defined patterns for controlled inference.
Moreover, determining the reliability and robustness of NLI models with regard to syllogistic reasoning is a crucial type of assessment, particularly in critical domains requiring safety guarantees Eisape et al. [2024], Jullien et al. [2024]. This type of evaluation is further motivated by two key factors. First, syllogistic reasoning is prevalent in discourse at large and, therefore, it is expected that approaches based on Large Language Models (LLMs) are exposed to a variety of reasoning schemes during pre-training Wu et al. [2023]. Second, a model that learns to perform syllogistic reasoning should intrinsically possess the ability to generalize to different domains, regardless of the specific world knowledge acquired during pre-training Kim et al. [2024]. This is because the ability to perform syllogistic reasoning should be content-independent, that is, the ability to derive logically valid conclusions is only a function of the formal logical schemes and should be independent of its concrete instantiation. However, despite the importance of syllogistic reasoning for NLI and the abundance of benchmarks involving commonsense knowledge Yu et al. [2023], resources for assessing how systematic reasoning capabilities transfer to specialised domains are still scarce and require substantial expert-level annotation effort to guarantee quality and correctness Zhao et al. [2024], Eisape et al. [2024], Porada et al. [2022], Wysocka et al. [2024].
This paper focuses on advancing the availability of resources for biomedical syllogistic reasoning along with our understanding of the capabilities of state-of-the-art NLI models. In particular, we propose SylloBio-NLI, a novel framework for the automatic generation of NLI resources for evaluating syllogistic reasoning within biomedical domains. Specifically, we demonstrate how external domain-specific resources such as ontologies and thesauri can be leveraged to instantiate a wide range of syllogistic schemes for biomedical NLI tasks, including textual entailment and evidence extraction. The methodological framework behind SylloBio-NLI is designed to address the annotation scarcity problem for the granular evaluation of complex reasoning in specialized domains. The framework, in fact, jointly minimises the human annotation effort and guarantees the correctness of the generated data by jointly leveraging explicit domain knowledge in the ontologies and the systematicity of known syllogistic schemes (see Fig. 1).
By instantiating SylloBio-NLI on human biological pathways using Reactome Croft et al. [2014], we evaluate the domain-specific reasoning capabilities of 8 open-source LLMs on 28 syllogistic schemes, comparing the performance in a zero-shot (ZS) and few-shot (FS) settings. An extensive empirical evaluation led to the following findings and conclusions:
We determine that LLMs exhibit surprisingly low ZS performance on biomedical syllogistic arguments with an average accuracy between 70% on generalised modus ponens and 23% on disjunctive syllogism (where random performance is equal to 50%). These low performances are generally shared across models’ family (i.e., Llama-3, Mistral, Gemma, BioMistral) and pre-trained regimes (language modelling and instruction-tuning).
We found that the FS setting can improve the performance of different LLMs. In particular, we observe a significant boost for both Gemma and Llama-3, with an average increase in F1-score of 14% and 43% respectively. At the same time, the experiments reveal that such improvement is inconsistent across models’ families and pre-training regimes.
We perform a robustness analysis adopting logically equivalent variations of the same syllogistic schemes by rephrasing the arguments via negations, complex predicates, and De Morgan’s laws. Such analysis reveals that both ZS and FS techniques are highly sensitive to surface-form and lexical variations, demonstrating a shared inability to systematically abstract the underlying reasoning rules required to derive valid conclusions. These results indicate that, while FS has the potential to elicit syllogistic reasoning in LLMs, existing models are still far from achieving the robustness and consistency required for safe biomedical NLI.
Overall, the above findings suggest that, upon granular inference scrutiny, the reasoning mechanisms induced in LLMs still confound formal and material inference patterns. Moreover, while there are FS intervention mechanisms which can improve models’ performance, delivering controlled specialised syllogistic reasoning remains a challenge for LLMs at large.
To the best of our knowledge, this is the first work focusing on designing a methodology for evaluating syllogistic reasoning within specialised domains, thoroughly assessing the performance of LLMs, and releasing a domain-specific resource for supporting future work in the field. The code for the dataset generation and the evaluation pipeline is fully available online [2].
This paper is available on arxiv under CC BY-NC-SA 4.0 license.
[1] code and dataset available at: anonymous_url.com
[2] https://anonymous-url.com