A. Formalization of the SylloBio-NLI Resource Generation Process
B. Formalization of Tasks 1 and 2
C. Dictionary of gene and pathway membership
D. Domain-specific pipeline for creating NL instances and E Accessing LLMs
H. Prompting LLMs - Zero-shot prompts
I. Prompting LLMs - Few-shot prompts
J. Results: Misaligned Instruction-Response
K. Results: Ambiguous Impact of Distractors on Reasoning
L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge
M Supplementary Figures and N Supplementary Tables
The entire experimental setup was implemented as a python code package, consisting of 5 modules:
• pathways: defines the ontological relations and operations for biological pathways, as well as the logic to retrieve and transform data from the domain ontology.
• logic2nl: translates logic formulas into NL statements (premises, conclusions), using parameterised scheme templates.
• llm: provides access to LLMs, and facilitates logging.
• experiments: defines all test logic, parameterisation and metrics for the experiments.
• main: orchestrates all the tests and aggregates results.
The LLMs were loaded and run using the HuggingFace transformers library (v4.43.3). The prompts were processed directly by the models with generate, without sampling.
Parameters were set as follows:
• Number of premises = 2: the number of valid premises per instance (factual argumentative text).
• Max distractors = 5: maximum number of distractors to be added per instance.
• Subset size = 200: maximum number of positive and negative instances for each scheme.
• Batch size = 20: number of instances evaluated simultaneously.
• ICL: Whether the in-context learning prompt would be used or not.
• Model: the LLM to be evaluated
For both Task 1 and Task 2, each model was analyzed across all 28 syllogistic schemes, with responses evaluated from a total of 11,200 instances (28 schemes × (200 positive + 200 negative)). The exceptions were: Generalized Modus Ponens - complex predicates (18 instances), Hypothetical Syllogism 1 - base (202 instances) and Generalized Contraposition - negation (202 instances). This is due to the number of factually possible different instances being limited by the ground-truth data in such cases. For both tasks, this amounted to a total of 211,840 instances, with performance comparisons made between ZS and ICL settings, leading to a total of 423,680 prompts.
When examining the impact of distractors, responses were analyzed across five variants of each argument scheme, reflecting different numbers of irrelevant premises (from 1 to 5 distractors). For each scheme and model, this resulted in 1,000 responses (200 prompts per scheme × 5 variants), with the exception of the aforementioned cases. Across all schemes, this totaled 26,480 responses for one model, and 211,840 responses were analyzed for all models for each number of distractors, totaling 1,059,200 responses. Again, comparisons were made between ZS and ICL settings, doubling the previous number.
For both Task 1 and Task 2, performance was compared using actual gene entity names versus synthetic names across two selected schemes. Each scheme was tested with 400 instances (200 positive, 200 negative), resulting in a comprehensive analysis of 12,800 instances per task (200 instances × 2 schemes × 2 entity types × 8 models), providing a thorough comparison of model performance with both factual and synthetic gene names.
The experiments were run on a computer with an AMD EPYC 7413 24-Core CPU, 128GB of available RAM and 2 × NVIDIA A100-SXM4-80GB GPUs.
Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.
This paper is available on arxiv under CC BY-NC-SA 4.0 license.