How AI Models Handle Complex Biomedical Reasoning

Table of Links Abstract and Introduction SylloBio-NLI Empirical Evaluation Related Work Conclusions Limitations and References A. Formalization of the SylloBio-NLI Resource Generation Process B. Formalization of Tasks 1 and 2 C. Dictionary of gene and pathway membership D. Domain-specific pipeline for creating NL instances and E Accessing LLMs F. Experimental Details G. Evaluation Metrics H. Prompting LLMs - Zero-shot prompts I. Prompting LLMs - Few-shot prompts J. Results: Misaligned Instruction-Response K. Results: Ambiguous Impact of Distractors on Reasoning L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge M Supplementary Figures and N Supplementary Tables 2 SylloBio-NLI We introduce a general framework for developing resources to systematically evaluate the syllogistic reasoning capabilities of NLI models within biomedical domains (Appendix A). In particular, SylloBio-NLI leverages the systematicity of known syllogistic schemes Betz et al. [2021] along with external thesauri/ontologies to scale up the generation of deductively valid arguments and to characterise inference performance across a granular set of syllogistic schemes. To this end, we focus on seven general syllogistic schemes, including generalized modus ponens, generalized contraposition, hypothetical syllogism 1, hypothetical syllogism 3, generalized modus tollens, disjunctive syllogism and generalized dilemma. These base schemes are then adopted to generate 28 different variations of syllogistic argument templates applying negation, complex predicates, and De Morgan’s laws, which can then be instantiated with concrete domain knowledge (see Fig. 6 in Appendix D). The overall methodology is outline by the stages in Fig. 2. First, for each syllogistic scheme, a corresponding formal argument scheme (consisting of abstract premises and conclusion expressed in first-order logic) is created (Fig. 2A). Next, each symbolic formula in the formal scheme is individually replaced by a natural language domain-specific sentence schema (Fig. 2B). For example, the formulae for the base schema of generalized modus ponens (i.e., Premise 1: ∀xF x ⇒ Gx, Premise 2: F a, Conclusion: Ga) can be translated into the natural language template: “Premise 1: Every member of F is a member of G”. Premise 2: “a is a member of F”. Conclusion: “a is a member of G”. Subsequently, the entity and property placeholders in the natural language template are replaced argument-wise with domain-specific entities and predicates extracted from an external ontology (Fig. 2C). In this process, the syllogistic schemes provide the logical validity component of the formal inference while the ontology subdomain and its associated mapping to the schemes provide the soundness (content-based) for the premises and conclusions. Hence, we obtain instances of syllogistic arguments in natural language that are stored in a knowledge base of premises and conclusions. Finally, the natural language syllogistic arguments can be structured to create prompts for evaluating LLMs. This involves introducing the syllogistic argument, clearly framing the premises, and specifying the instructions for the inference, as illustrated in Appendix H. 2.1 Biomedical NLI Tasks SylloBio-NLI takes advantage of the corpus’s design, where every premise and conclusion in each argument is explicitly stated and fully transparent (Fig. 2). Once the full set of natural language syllogistic arguments is generated, the premises and conclusions in each instance can be adopted to define two different NLI tasks: Task 1. A textual entailment task where the model determines whether the conclusion logically follows from the premises, with the output being ’True’ or ’False’. The task is designed using natural language syllogisms generated from formal schemes, with valid and invalid argument instances. The model evaluates each argument’s logical validity, requiring it to reason deductively based on the given premises. The focus is on testing the ability to discern valid reasoning structures in a biomedical context, independent of factual correctness. Task 2. A premise selection task where the model has to identify which premises are necessary and sufficient to justify the conclusion from Task 1, noting that some premises may be irrelevant. The task involves presenting the model with a set of premises, including both relevant and distractor premises. The model must correctly select the subset of premises that logically support the conclusion, excluding those that do not contribute to the entailment. This tests the ability to filter essential information and assess the understanding of the logical relationships between premises and conclusions. A detailed formalization of the tasks can be found in Appendix B. 2.2 Human Genome Pathways for Syllogistic Reasoning Evaluation To instantiate the SylloBio-NLI methodology, we developed a specialized dataset using Reactome[3], a comprehensive knowledge base containing detailed information on human biological pathways, including gene functions and interactions. Reactome’s hierarchical structure, with its well-defined hierarchical gene-pathway membership relations, enables a systematic instantiation of the syllogistic arguments. This allows for the efficient generation of domain-specific NLI tasks that assess LLMs’ ability to reason about biological pathway membership, controlling for different levels within the hierarchy. This use case is widely relevant in the context of interpreting pathway-level interactions, disease and treatment response mechanisms at a genomics level Fang et al. [2019]. By focusing on the Disease super-pathway, the largest and most intricate group within Reactome, we generated a diverse set of syllogistic schemes (see Fig. 2) that mirror the complex molecular interactions and regulatory mechanisms underlying disease processes. This approach ensures our evaluation reflects the depth and specificity needed to rigorously test the syllogistic reasoning properties of NLI models in this highly specialised and clinically significant domain. A comprehensive and systematic assessment of these properties is essential if LLMs are to be applied to this domain. The resulting dataset includes 12,098 entity-gene names, which are used as substitutes for entity placeholders. In contrast, pathway names, derived from different levels of biological hierarchies, are treated as predicate placeholders because they describe relationships or actions involving the entities, rather than being entities themselves. The corpus includes 3767 complex predicates in total. The premises of the natural language argument are randomly re-ordered to mitigate potential biases during the evaluation (Fig. 2D). Additional details can be found in the Appendix C. 2.3 Reasoning Challenges Domain-specific Reasoning. Biomedical datasets differ from general datasets in that they involve complex semantic structures, such as gene-pathway relationships and hierarchies, where sentences encode intricate biological interactions and dependencies. Reasoning about statements such as "COL1A1 is involved in the TGF-beta signalling pathway" requires not only an understanding of specific gene functions but also the ability to navigate over hierarchical biological pathways and their associated relations. This demands a higher level of domain-specific reasoning and the capability to controllably interpret relationships within a highly specialised context. Material Inference Component. Each sentence in the formal scheme (premises and conclusion) is linked to the domain knowledge, i.e. the membership of genes to a given level of pathway, which refers to a multi-level, hierarchy. For example, if gene X is a member of pathway Y, and that pathway is a child of a top-level pathway Z, the model must be able to infer that the gene also belongs to a top-level pathway. Formal Inference Component. Syllogisms are examples of content-independent formal inference in which, provided the truth value of the premises, the conclusion can be derived through the application of specific logical inference rules. This form of reasoning can pose challenges to models that are incapable of abstracting inference patterns from text and that are affected by content biases in their representation Prange et al. [2023], Kim et al. [2024]. Moreover, the syllogistic schemes in this paper are designed to assess the interpretation of fine-grained logical operators including quantifiers, implications and negation, which have been proved to be particularly challenging for NLI models Pitler et al. [2023]. Authors: (1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom; (2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom; (3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I; (4) Marco Valentino, Idiap Research Institute, Switzerland; (5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. This paper is available on arxiv under CC BY-NC-SA 4.0 license. [3] https://reactome.org Table of Links Abstract and Introduction SylloBio-NLI Empirical Evaluation Related Work Conclusions Limitations and References Abstract and Introduction Abstract and Introduction SylloBio-NLI SylloBio-NLI Empirical Evaluation Empirical Evaluation Related Work Related Work Conclusions Conclusions Limitations and References Limitations and References A. Formalization of the SylloBio-NLI Resource Generation Process A. Formalization of the SylloBio-NLI Resource Generation Process B. Formalization of Tasks 1 and 2 B. Formalization of Tasks 1 and 2 C. Dictionary of gene and pathway membership C. Dictionary of gene and pathway membership D. Domain-specific pipeline for creating NL instances and E Accessing LLMs D. Domain-specific pipeline for creating NL instances and E Accessing LLMs F. Experimental Details F. Experimental Details G. Evaluation Metrics G. Evaluation Metrics H. Prompting LLMs - Zero-shot prompts H. Prompting LLMs - Zero-shot prompts I. Prompting LLMs - Few-shot prompts I. Prompting LLMs - Few-shot prompts J. Results: Misaligned Instruction-Response J. Results: Misaligned Instruction-Response K. Results: Ambiguous Impact of Distractors on Reasoning K. Results: Ambiguous Impact of Distractors on Reasoning L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge M Supplementary Figures and N Supplementary Tables M Supplementary Figures and N Supplementary Tables 2 SylloBio-NLI We introduce a general framework for developing resources to systematically evaluate the syllogistic reasoning capabilities of NLI models within biomedical domains (Appendix A). In particular, SylloBio-NLI leverages the systematicity of known syllogistic schemes Betz et al. [2021] along with external thesauri/ontologies to scale up the generation of deductively valid arguments and to characterise inference performance across a granular set of syllogistic schemes. To this end, we focus on seven general syllogistic schemes, including generalized modus ponens, generalized contraposition, hypothetical syllogism 1, hypothetical syllogism 3, generalized modus tollens, disjunctive syllogism and generalized dilemma. These base schemes are then adopted to generate 28 different variations of syllogistic argument templates applying negation, complex predicates, and De Morgan’s laws, which can then be instantiated with concrete domain knowledge (see Fig. 6 in Appendix D). The overall methodology is outline by the stages in Fig. 2. First, for each syllogistic scheme, a corresponding formal argument scheme (consisting of abstract premises and conclusion expressed in first-order logic) is created (Fig. 2A). Next, each symbolic formula in the formal scheme is individually replaced by a natural language domain-specific sentence schema (Fig. 2B). For example, the formulae for the base schema of generalized modus ponens (i.e., Premise 1: ∀xF x ⇒ Gx, Premise 2: F a, Conclusion: Ga) can be translated into the natural language template: “Premise 1: Every member of F is a member of G”. Premise 2: “a is a member of F”. Conclusion: “a is a member of G”. Subsequently, the entity and property placeholders in the natural language template are replaced argument-wise with domain-specific entities and predicates extracted from an external ontology (Fig. 2C). In this process, the syllogistic schemes provide the logical validity component of the formal inference while the ontology subdomain and its associated mapping to the schemes provide the soundness (content-based) for the premises and conclusions. Hence, we obtain instances of syllogistic arguments in natural language that are stored in a knowledge base of premises and conclusions. Finally, the natural language syllogistic arguments can be structured to create prompts for evaluating LLMs. This involves introducing the syllogistic argument, clearly framing the premises, and specifying the instructions for the inference, as illustrated in Appendix H. 2.1 Biomedical NLI Tasks SylloBio-NLI takes advantage of the corpus’s design, where every premise and conclusion in each argument is explicitly stated and fully transparent (Fig. 2). Once the full set of natural language syllogistic arguments is generated, the premises and conclusions in each instance can be adopted to define two different NLI tasks: Task 1. A textual entailment task where the model determines whether the conclusion logically follows from the premises, with the output being ’True’ or ’False’. The task is designed using natural language syllogisms generated from formal schemes, with valid and invalid argument instances. The model evaluates each argument’s logical validity, requiring it to reason deductively based on the given premises. The focus is on testing the ability to discern valid reasoning structures in a biomedical context, independent of factual correctness. Task 1. Task 2. A premise selection task where the model has to identify which premises are necessary and sufficient to justify the conclusion from Task 1, noting that some premises may be irrelevant. The task involves presenting the model with a set of premises, including both relevant and distractor premises. The model must correctly select the subset of premises Task 2. that logically support the conclusion, excluding those that do not contribute to the entailment. This tests the ability to filter essential information and assess the understanding of the logical relationships between premises and conclusions. A detailed formalization of the tasks can be found in Appendix B. 2.2 Human Genome Pathways for Syllogistic Reasoning Evaluation To instantiate the SylloBio-NLI methodology, we developed a specialized dataset using Reactome[3], a comprehensive knowledge base containing detailed information on human biological pathways, including gene functions and interactions. Reactome’s hierarchical structure, with its well-defined hierarchical gene-pathway membership relations, enables a systematic instantiation of the syllogistic arguments. This allows for the efficient generation of domain-specific NLI tasks that assess LLMs’ ability to reason about biological pathway membership, controlling for different levels within the hierarchy. This use case is widely relevant in the context of interpreting pathway-level interactions, disease and treatment response mechanisms at a genomics level Fang et al. [2019]. By focusing on the Disease super-pathway, the largest and most intricate group within Reactome, we generated a diverse set of syllogistic schemes (see Fig. 2) that mirror the complex molecular interactions and regulatory mechanisms underlying disease processes. This approach ensures our evaluation reflects the depth and specificity needed to rigorously test the syllogistic reasoning properties of NLI models in this highly specialised and clinically significant domain. A comprehensive and systematic assessment of these properties is essential if LLMs are to be applied to this domain. The resulting dataset includes 12,098 entity-gene names, which are used as substitutes for entity placeholders. In contrast, pathway names, derived from different levels of biological hierarchies, are treated as predicate placeholders because they describe relationships or actions involving the entities, rather than being entities themselves. The corpus includes 3767 complex predicates in total. The premises of the natural language argument are randomly re-ordered to mitigate potential biases during the evaluation (Fig. 2D). Additional details can be found in the Appendix C. 2.3 Reasoning Challenges Domain-specific Reasoning. Biomedical datasets differ from general datasets in that they involve complex semantic structures, such as gene-pathway relationships and hierarchies, where sentences encode intricate biological interactions and dependencies. Reasoning about statements such as "COL1A1 is involved in the TGF-beta signalling pathway" requires not only an understanding of specific gene functions but also the ability to navigate over hierarchical biological Domain-specific Reasoning. pathways and their associated relations. This demands a higher level of domain-specific reasoning and the capability to controllably interpret relationships within a highly specialised context. Material Inference Component . Each sentence in the formal scheme (premises and conclusion) is linked to the domain knowledge, i.e. the membership of genes to a given level of pathway, which refers to a multi-level, hierarchy. For example, if gene X is a member of pathway Y, and that pathway is a child of a top-level pathway Z, the model must be able to infer that the gene also belongs to a top-level pathway. Material Inference Component Formal Inference Component. Syllogisms are examples of content-independent formal inference in which, provided the truth value of the premises, the conclusion can be derived through the application of specific logical inference rules. This form of reasoning can pose challenges to models that are incapable of abstracting inference patterns from text and that are affected by content biases in their representation Prange et al. [2023], Kim et al. [2024]. Moreover, the syllogistic schemes in this paper are designed to assess the interpretation of fine-grained logical operators including quantifiers, implications and negation, which have been proved to be particularly challenging for NLI models Pitler et al. [2023]. Formal Inference Component. Authors: (1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom; (2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom; (3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I; (4) Marco Valentino, Idiap Research Institute, Switzerland; (5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. Authors: Authors: (1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom; (2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom; (3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I; (4) Marco Valentino, Idiap Research Institute, Switzerland; (5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. This paper is available on arxiv under CC BY-NC-SA 4.0 license. This paper is available on arxiv under CC BY-NC-SA 4.0 license. available on arxiv [3] https://reactome.org