Table of Links Abstract and Introduction
SylloBio-NLI
Empirical Evaluation
Related Work
Conclusions
Limitations and References A. Formalization of the SylloBio-NLI Resource Generation Process B. Formalization of Tasks 1 and 2 C. Dictionary of gene and pathway membership D. Domain-specific pipeline for creating NL instances and E Accessing LLMs F. Experimental Details G. Evaluation Metrics H. Prompting LLMs - Zero-shot prompts I. Prompting LLMs - Few-shot prompts J. Results: Misaligned Instruction-Response K. Results: Ambiguous Impact of Distractors on Reasoning L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge M Supplementary Figures and N Supplementary Tables F Experimental Details The entire experimental setup was implemented as a python code package, consisting of 5 modules: • pathways: defines the ontological relations and operations for biological pathways, as well as the logic to retrieve and transform data from the domain ontology. • logic2nl: translates logic formulas into NL statements (premises, conclusions), using parameterised scheme templates. • llm: provides access to LLMs, and facilitates logging. • experiments: defines all test logic, parameterisation and metrics for the experiments. • main: orchestrates all the tests and aggregates results. The LLMs were loaded and run using the HuggingFace transformers library (v4.43.3). The prompts were processed directly by the models with generate, without sampling. Parameters were set as follows: • Number of premises = 2: the number of valid premises per instance (factual argumentative text). • Max distractors = 5: maximum number of distractors to be added per instance. • Subset size = 200: maximum number of positive and negative instances for each scheme. • Batch size = 20: number of instances evaluated simultaneously. • ICL: Whether the in-context learning prompt would be used or not. • Model: the LLM to be evaluated For both Task 1 and Task 2, each model was analyzed across all 28 syllogistic schemes, with responses evaluated from a total of 11,200 instances (28 schemes × (200 positive + 200 negative)). The exceptions were: Generalized Modus Ponens - complex predicates (18 instances), Hypothetical Syllogism 1 - base (202 instances) and Generalized Contraposition - negation (202 instances). This is due to the number of factually possible different instances being limited by the ground-truth data in such cases. For both tasks, this amounted to a total of 211,840 instances, with performance comparisons made between ZS and ICL settings, leading to a total of 423,680 prompts. When examining the impact of distractors, responses were analyzed across five variants of each argument scheme, reflecting different numbers of irrelevant premises (from 1 to 5 distractors). For each scheme and model, this resulted in 1,000 responses (200 prompts per scheme × 5 variants), with the exception of the aforementioned cases. Across all schemes, this totaled 26,480 responses for one model, and 211,840 responses were analyzed for all models for each number of distractors, totaling 1,059,200 responses. Again, comparisons were made between ZS and ICL settings, doubling the previous number. For both Task 1 and Task 2, performance was compared using actual gene entity names versus synthetic names across two selected schemes. Each scheme was tested with 400 instances (200 positive, 200 negative), resulting in a comprehensive analysis of 12,800 instances per task (200 instances × 2 schemes × 2 entity types × 8 models), providing a thorough comparison of model performance with both factual and synthetic gene names. The experiments were run on a computer with an AMD EPYC 7413 24-Core CPU, 128GB of available RAM and 2 × NVIDIA A100-SXM4-80GB GPUs. Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. This paper is available on arxiv under CC BY-NC-SA 4.0 license. Table of Links Abstract and Introduction SylloBio-NLI Empirical Evaluation Related Work Conclusions Limitations and References Abstract and Introduction Abstract and Introduction SylloBio-NLI SylloBio-NLI Empirical Evaluation Empirical Evaluation Related Work Related Work Conclusions Conclusions Limitations and References Limitations and References A. Formalization of the SylloBio-NLI Resource Generation Process A. Formalization of the SylloBio-NLI Resource Generation Process B. Formalization of Tasks 1 and 2 B. Formalization of Tasks 1 and 2 C. Dictionary of gene and pathway membership C. Dictionary of gene and pathway membership D. Domain-specific pipeline for creating NL instances and E Accessing LLMs D. Domain-specific pipeline for creating NL instances and E Accessing LLMs F. Experimental Details F. Experimental Details G. Evaluation Metrics G. Evaluation Metrics H. Prompting LLMs - Zero-shot prompts H. Prompting LLMs - Zero-shot prompts I. Prompting LLMs - Few-shot prompts I. Prompting LLMs - Few-shot prompts J. Results: Misaligned Instruction-Response J. Results: Misaligned Instruction-Response K. Results: Ambiguous Impact of Distractors on Reasoning K. Results: Ambiguous Impact of Distractors on Reasoning L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge M Supplementary Figures and N Supplementary Tables M Supplementary Figures and N Supplementary Tables F Experimental Details The entire experimental setup was implemented as a python code package, consisting of 5 modules: • pathways: defines the ontological relations and operations for biological pathways, as well as the logic to retrieve and transform data from the domain ontology. • logic2nl: translates logic formulas into NL statements (premises, conclusions), using parameterised scheme templates. • llm: provides access to LLMs, and facilitates logging. • experiments: defines all test logic, parameterisation and metrics for the experiments. • main: orchestrates all the tests and aggregates results. The LLMs were loaded and run using the HuggingFace transformers library (v4.43.3). The prompts were processed directly by the models with generate, without sampling. Parameters were set as follows: • Number of premises = 2: the number of valid premises per instance (factual argumentative text). • Max distractors = 5: maximum number of distractors to be added per instance. • Subset size = 200: maximum number of positive and negative instances for each scheme. • Batch size = 20: number of instances evaluated simultaneously. • ICL: Whether the in-context learning prompt would be used or not. • Model: the LLM to be evaluated For both Task 1 and Task 2, each model was analyzed across all 28 syllogistic schemes, with responses evaluated from a total of 11,200 instances (28 schemes × (200 positive + 200 negative)). The exceptions were: Generalized Modus Ponens - complex predicates (18 instances), Hypothetical Syllogism 1 - base (202 instances) and Generalized Contraposition - negation (202 instances). This is due to the number of factually possible different instances being limited by the ground-truth data in such cases. For both tasks, this amounted to a total of 211,840 instances, with performance comparisons made between ZS and ICL settings, leading to a total of 423,680 prompts. When examining the impact of distractors, responses were analyzed across five variants of each argument scheme, reflecting different numbers of irrelevant premises (from 1 to 5 distractors). For each scheme and model, this resulted in 1,000 responses (200 prompts per scheme × 5 variants), with the exception of the aforementioned cases. Across all schemes, this totaled 26,480 responses for one model, and 211,840 responses were analyzed for all models for each number of distractors, totaling 1,059,200 responses. Again, comparisons were made between ZS and ICL settings, doubling the previous number. For both Task 1 and Task 2, performance was compared using actual gene entity names versus synthetic names across two selected schemes. Each scheme was tested with 400 instances (200 positive, 200 negative), resulting in a comprehensive analysis of 12,800 instances per task (200 instances × 2 schemes × 2 entity types × 8 models), providing a thorough comparison of model performance with both factual and synthetic gene names. The experiments were run on a computer with an AMD EPYC 7413 24-Core CPU, 128GB of available RAM and 2 × NVIDIA A100-SXM4-80GB GPUs. Authors: (1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom; (2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom; (3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I; (4) Marco Valentino, Idiap Research Institute, Switzerland; (5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. Authors: Authors: (1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom; (2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom; (3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I; (4) Marco Valentino, Idiap Research Institute, Switzerland; (5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland. This paper is available on arxiv under CC BY-NC-SA 4.0 license. This paper is available on arxiv under CC BY-NC-SA 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Experimental Setup for Evaluating LLM Performance on Biomedical Syllogistic Tasks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Formalization of the SylloBio-NLI Resource Generation Process

New Study Reveals AI's Weak Spots in Medical Logic

How AI Models Handle Complex Biomedical Reasoning

How Effective Are AI Models at Biomedical Problem Solving?

Testing AI's Ability to Think Logically: A Breakthrough in Biomedical Research

Enhancing Syllogistic Reasoning in Biomedical NLI: Key Insights and Challenges

A Formalization of the SylloBio-NLI Resource Generation Process

New Study Reveals AI's Weak Spots in Medical Logic

How AI Models Handle Complex Biomedical Reasoning

How Effective Are AI Models at Biomedical Problem Solving?

Testing AI's Ability to Think Logically: A Breakthrough in Biomedical Research

Enhancing Syllogistic Reasoning in Biomedical NLI: Key Insights and Challenges

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps