A. Formalization of the SylloBio-NLI Resource Generation Process
B. Formalization of Tasks 1 and 2
C. Dictionary of gene and pathway membership
D. Domain-specific pipeline for creating NL instances and E Accessing LLMs
H. Prompting LLMs - Zero-shot prompts
I. Prompting LLMs - Few-shot prompts
J. Results: Misaligned Instruction-Response
K. Results: Ambiguous Impact of Distractors on Reasoning
L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge
M Supplementary Figures and N Supplementary Tables
Reactome[4] (version 88—March 2024) has entries for 11 226 protein-coding genes involved in 15 212 human reactions annotated from 38 549 literature references. These reactions are grouped into 2 698 pathways collected under 29 superpathways (e.g. Immune System) that describe normal cellular functions. Each superpathway is represented as a roughly circular ‘burst,’ with the central node corresponding to the top-level of the Reactome event hierarchy and concentric rings representing increasingly specific levels of the event hierarchy (sub-pathways) (e.g. Disease → Diseases of signal transduction by growth factor receptors and second messengers → Signalling by EGFR in Cancer → Signalling by Ligand-Responsive EGFR Variants in Cancer → Constitutive Signalling by Ligand-Responsive EGFR Cancer Variants). The relationships between these pathways are captured through parent-child arcs, reflecting the ontological "is-a" relationships.
The 29 Reactome superpathways group are each organized as a roughly circular ‘burst’. However, we built the corpus based on one, largest group of pathways called Disease. The central node of the Disease burst corresponds to the uppermost level of the Reactome event hierarchy (Table 2). Concentric rings of nodes around the central node represent successive more specific levels of the event hierarchy (e.g. Disease → Diseases of signal transduction by growth factor receptors and second messengers → Signalling by EGFR in Cancer → Signalling by Ligand-Responsive EGFR Variants in Cancer → Constitutive Signalling by Ligand-Responsive EGFR Cancer Variants). The arcs connecting nodes between successive rings within a burst represent parent–child (is-a) relationships in the event hierarchy. When a specific pathway is shared by more than one burst, arcs connect its nodes between bursts. A node’s size is proportional to the number of physical entities (proteins, complexes, chemicals) it contains.
The dictionary of entity-names and predicates based on Reactome knowledgebase are available on GitHub.
Authors:
(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;
(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;
(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;
(4) Marco Valentino, Idiap Research Institute, Switzerland;
(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.
This paper is available on arxiv under CC BY-NC-SA 4.0 license.
[4] https://reactome.org