This story draft by @escholar has not been reviewed by an editor, YET.
Authors:
(1) Soyeong Jeong, School of Computing;
(2) Jinheon Baek, Graduate School of AI;
(3) Sukmin Cho, School of Computing;
(4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology;
(5) Jong C. Park, School of Computing.
3 Method and 3.1 Preliminaries
3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation
4 Experimental Setups and 4.1 Datasets
4.2 Models and 4.3 Evaluation Metrics
5 Experimental Results and Analyses
6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References
A Additional Experimental Setups
B Additional Experimental Results
For a fair comparison and following Mallen et al. (2023) and Trivedi et al. (2023), we use the same retriever, a term-based sparse retrieval model known as BM25 (Robertson et al., 1994), across all different models. For the external document corpus, we use different sources depending on the dataset type: the Wikipedia corpus preprocessed by Karpukhin et al. (2020) for single-hop datasets, and the preprocessed corpus by Trivedi et al. (2023) for multihop datasets. Regarding the LLMs that are used to generate answers, we use the FLAN-T5 series models (Chung et al., 2022) of XL with 3B parameters and XXL with 11B parameters, and the GPT-3.5 model (gpt-3.5-turbo-instruct). For the retrieval-augmented LLM design, we follow the implementation details from Trivedi et al. (2023), which include input prompts, instructions, and the number of test samples for evaluation (e.g., 500 samples per dataset). In our Adaptive-RAG, for the query-complexity classifier, we use and train the T5-Large model (Raffel et al., 2020). Specifically, the classifier is trained using the epoch that shows the best performance until 100 training iterations from the validation set, with the learning rate of 3e-5 and the AdamW (Loshchilov and Hutter, 2019) as an optimizer. Regarding its training data, we sample and annotate 400 queries from 6 datasets based on its inductive bias (single-hop for one-step approach and multi-hop for multi-step). In addition, we use predicted outcomes of three different strategies over 400 queries sampled from each dataset. Note that those queries used for classifier training do not overlap with the testing queries for QA.
This paper is available on arxiv under CC0 1.0 DEED license.