A. Formalization of the SylloBio-NLI Resource Generation Process
B. Formalization of Tasks 1 and 2
C. Dictionary of gene and pathway membership
D. Domain-specific pipeline for creating NL instances and E Accessing LLMs
H. Prompting LLMs - Zero-shot prompts
I. Prompting LLMs - Few-shot prompts
J. Results: Misaligned Instruction-Response
K. Results: Ambiguous Impact of Distractors on Reasoning
L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge
M Supplementary Figures and N Supplementary Tables
We used the proposed methodology and resources to assess the syllogistic NLI inference properties of eight opensource LLMs. We test a range of architectures, including mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.2 Jiang et al. [2023], mistralai/Mixtral-8x7B-Instruct-v0.1 Mistral [2023], google/gemma-7b, google/gemma-7b-it Gemma and Google [2024], meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-8B-Instruct AI@Meta [2024], BioMistral/BioMistral-7B Labrak et al. [2024]. Details of the models, access, parameters, and the prompts used are available in the Appendix E, F, H, I.
The main results on Task 1 and 2 are reported in Table 1. In particular, it is possible to derive the following main conclusions:
ZS LLMs struggle with controlled, domain-specific syllogistic inference. Overall, the results demonstrate that LLMs exhibit significant challenges when performing biomedical syllogistic reasoning in a ZS setting (Table 1). We found, in fact, that the majority of the models struggle to achieve performances that are above the 50% random accuracy on Task 1, with Gemma-7B-it being the only exception, achieving an accuracy of 64%. Similarly, in Task 2, Gemma-7B-it was again the best-performing model, achieving reasoning accuracy of 55%. In general, when comparing the performance on the two tasks, we observed that the models achieved lower performance in the premise selection task compared to textual entailment, indicating a shared inability of ZS models to identify the sufficient and necessary set of premises required for the entailment to hold. In addition to low performance, we found a discrepancy between faithfulness and accuracy on both tasks, which highlights the inability to capture the underlying reasoning phenomena along with the presence of biases in the inference process. For example, we observed that for Gemma-7B, the ZS accuracy and F1-score are ≈ 0.5-0.6, while faithfulness plummets below 0.1 (Fig. 3). This is due to the tendency of the model to frequently output "True" (see Figs. 8-11 in Appendix M and Appendix J) regardless of the truth value of the conclusion. In contrast, we observed that such a phenomenon is less prominent in Gemma-7B-it where faithfulness is comparable to accuracy on both Task 1 (i.e., 0.66) and Task 2 (0.65). Overall, the results reveal that instruction-tuned models can obtain higher performances in the ZS setting, while models that are pre-trained on language modelling struggle to follow the provided natural language instructions. Moreover, we observed that this trend is independent of the specific knowledge the LLM is exposed to during pre-training. For instance, BioMistral-7B, the only biomedical domain-specific model, entirely failed to follow the instructions and generate outputs that are relevant to the target tasks.
FS can improve performance, but its impact is inconsistent across models’ families and pre-training regimes. We found that the FS setting can improve the performance of different LLMs. In particular, we observe a significant
boost for both Gemma-7b and Meta-Llama-3-8b, with an average increase in F1-score of 14% and 43% respectively. However, we observe such improvement to be inconsistent across models’ family and pre-training regime, indicating variability in how different models handle prompt types in syllogistic reasoning tasks within the biomedical domain (Fig. 4). In particular, the results show that FS improves for all schemes only for Gemma-7b and Llama-3. For Mixtral-8x7B Instruct, Mistral-7B-Instruct, and Gemma-7b-it the overall accuracy increases only for a subset of selected schemes, while for others it drops significantly. This inconsistency is demonstrated by the fact that Mixtral-8x7B Instruct achieves worse accuracy in FS compared to ZS on both tasks. Overall, the results on the FS setting highlight distinct strengths and weaknesses that are highly dependent on the prompt strategy, the pre-training setup, and the model’s family. At the same time, improvements observed in FS for some models might indicate a potential for enhanced performance when contextual information is available, although this gain is not universal across the models.
Significant variability in model accuracy is observed across different syllogistic schemes. In general, we found that LLMs exhibit variable performances depending on the specific syllogistic schema used for evaluation. The results indicate that LLMs perform better on generalized modus ponens (i.e., three out of four models), with accuracy levels ranging from 0.71 to 0.98 in the FS setting in Task 1 (Fig. 5, Table 5 in Appendix N) and with reasoning accuracy scores ranging from 0.56 to 0.74 in Task 2 (Table 7 in Appendix N). On the other side, In the ZS setting, the Gemma-7B-it model achieved higher performance on generalized modus tollens (accuracy = 0.78 in Task 1 and RA = 0.74 in Task 2), while in the FS setting, the highest performances were registered on generalized dilemma in Task 1 (accuracy = 0.85) and generalized modus ponens in Task 2 (RA = 0.74).
We performed a robustness analysis adopting surface-form variations applied to the syllogistic schemes. This is done by rephrasing the natural language syllogistic arguments via negation, complex predicates, and De Morgan’s laws, keeping the underlying logical relation between premises and conclusion unaltered (see Fig. 6 in Appendix D). The results of the analysis are reported in Figure 5.
This intervention reveals that LLMs are highly sensitive to surface forms and lexical variations, showing a significant variability across different schemes (Fig. 5). At the same time, we found that in the FS setting, model responses demonstrated overall greater consistency, with lower fluctuations across variants.
In Task 1, disjunctive syllogism was generally identified as the most challenging scheme in the ZS setting for all four models, with none exceeding an accuracy of 0.35, significantly worse than random guessing. While FS slightly improved accuracy, disjunctive syllogism remained the most challenging scheme. Overall, only Gemma-7B in the FS setting effectively handled all syllogistic schemes and their variants in Task 1, demonstrating a higher level of consistency. The other models exhibited significant variability depending on the specific surface form, which highlights a shared inability to systematically abstract the underlying reasoning rules required to derive valid conclusions.
Finally, we perform an analysis introducing an increasing number of distractors (i.e., irrelevant premises) in the syllogistic arguments and replacing existing genes in Reactome with synthetic gene names to generate arguments that are independent of medical content and assess the impact of factuality on reasoning (see Appendix K, L).
We found that LLMs show varied sensitivity to distractors, with models like Gemma-7b exhibiting a significant decline in reasoning accuracy as the number of distractors increases, while Mistral-7B Instruct shows improvements in some cases. The impact is scheme-dependent and reveals that more complex syllogisms are particularly affected by increasing distracting information (Figs. 12-15 in Appendix M).
Furthermore, contrary to previous work showing that LLMs exhibit higher performance on syllogisms that are in line with commonsense knowledge Eisape et al. [2024], Kim et al. [2024], we found stable performances when intervening on the factual correctness of the biomedical arguments. These results indicate that (1) logical structure and contextual information have a greater impact on output generation than biomedical knowledge in both ZS and FS settings and (2) existing models might have limited exposition to human genome pathways information during pre-training, being unaffected by the substitution of real gene names with synthetic ones (Fig. 7 in Appendix L).
Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.
This paper is available on arxiv under CC BY-NC-SA 4.0 license.