paint-brush

This story draft by @largemodels has not been reviewed by an editor, YET.

LLMs Rely on Contextual Knowledge Over Background Knowledge

featured image - LLMs Rely on Contextual Knowledge Over Background Knowledge
Large Models HackerNoon profile picture
0-item

Table of Links

  1. Abstract and Introduction
  2. SylloBio-NLI
  3. Empirical Evaluation
  4. Related Work
  5. Conclusions
  6. Limitations and References


A. Formalization of the SylloBio-NLI Resource Generation Process

B. Formalization of Tasks 1 and 2

C. Dictionary of gene and pathway membership

D. Domain-specific pipeline for creating NL instances and E Accessing LLMs

F. Experimental Details

G. Evaluation Metrics

H. Prompting LLMs - Zero-shot prompts

I. Prompting LLMs - Few-shot prompts

J. Results: Misaligned Instruction-Response

K. Results: Ambiguous Impact of Distractors on Reasoning

L. Results: Models Prioritize Contextual Knowledge Over Background Knowledge

M Supplementary Figures and N Supplementary Tables

L Results: Models Prioritize Contextual Knowledge Over Background Knowledge

The lack of statistically significant differences (Fig. 7) in accuracy between biologically factual and artificial datasets across generalized modus ponens and generalized modus tollens schemes suggests that the models’ reasoning capabilities rely more on stated contextual knowledge and logical structure than on pre-existing background knowledge. This holds true for both accuracy and reasoning accuracy, as well as in both ZS and FS settings: models that perform well on a given scheme maintain their performance even when factual gene names are replaced by synthetic names, and the same consistency is observed for models with weaker performance. This ability to maintain accuracy with synthetic gene names in the artificial set demonstrates that models can abstract and apply logical reasoning independently of their internal domain-specific knowledge.


Figure 7: Accuracy comparison between two datasets: Biologically Factual vs. Artificial with Synthetic Names for Task 1 (top) and Task 2 (bottom); ZS (left) and FS (right). Lines connect the accuracy for each model, with green indicating an increase and red indicating a decrease. Gray boxplots display the median, Q1, Q3, and the range (minimum to maximum) of the data.


Figure 8: Percentage distribution of model response types under zero-shot settings for prompts with no distractors for the set of biologically factual argumentative texts.


Authors:

(1) Magdalena Wysocka, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom;

(2) Danilo S. Carvalho, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and Department of Computer Science, Univ. of Manchester, United Kingdom;

(3) Oskar Wysocki, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom and ited Kingdom 3 I;

(4) Marco Valentino, Idiap Research Institute, Switzerland;

(5) André Freitas, National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom, Department of Computer Science, Univ. of Manchester, United Kingdom and Idiap Research Institute, Switzerland.


This paper is available on arxiv under CC BY-NC-SA 4.0 license.