paint-brush

This story draft by @escholar has not been reviewed by an editor, YET.

Experimental Setups and Datasets

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Soyeong Jeong, School of Computing;

(2) Jinheon Baek, Graduate School of AI;

(3) Sukmin Cho, School of Computing;

(4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology;

(5) Jong C. Park, School of Computing.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Method and 3.1 Preliminaries

3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation

4 Experimental Setups and 4.1 Datasets

4.2 Models and 4.3 Evaluation Metrics

4.4 Implementation Details

5 Experimental Results and Analyses

6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References


A Additional Experimental Setups

B Additional Experimental Results

4 Experimental Setups

In this section, we explain datasets, models, metrics, and implementation details. We provide additional details in Appendix A.

4.1 Datasets

In order to simulate a realistic scenario, where different queries have varying complexities, we use both the single-hop and multi-hop QA datasets simultaneously, in the unified experimental setting.


Single-hop QA For simpler queries, we use three benchmark single-hop QA datasets, which consist of queries and their associated documents containing answers, namely 1) SQuAD v1.1 (Rajpurkar et al., 2016), 2) Natural Questions (Kwiatkowski et al., 2019), and 3) TriviaQA (Joshi et al., 2017).


Table 1: Averaged results on a collection of benchmark datasets for open-domain question answering including the single-hop and multi-hop queries, with different LLMs. Self-RAG∗ is trained with a different base LLM, namely LLaMA2 (Touvron et al., 2023); therefore, we compare the results of FLAN-T5-XL (3B) with the results from Self-RAG with LLaMA2 (7B) and the results of others with the results from Self-RAG with LLaMA2 (13B). We emphasize our results in bold, for easy comparisons.


Multi-hop QA To consider more complex query scenarios, we use three benchmark multi-hop QA datasets, which require sequential reasoning over multiple documents, namely 1) MuSiQue (Trivedi et al., 2022a), 2) HotpotQA (Yang et al., 2018), and 3) 2WikiMultiHopQA (Ho et al., 2020).


This paper is available on arxiv under CC0 1.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...