Smart Queries, Smarter Answers: Adaptive-RAG in Action

Authors: (1) Soyeong Jeong, School of Computing; (2) Jinheon Baek, Graduate School of AI; (3) Sukmin Cho, School of Computing; (4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology; (5) Jong C. Park, School of Computing. Authors: Authors: (1) Soyeong Jeong, School of Computing; (2) Jinheon Baek, Graduate School of AI; (3) Sukmin Cho, School of Computing; (4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology; (5) Jong C. Park, School of Computing. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Method and 3.1 Preliminaries 3 Method and 3.1 Preliminaries 3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation 3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation 4 Experimental Setups and 4.1 Datasets 4 Experimental Setups and 4.1 Datasets 4.2 Models and 4.3 Evaluation Metrics 4.2 Models and 4.3 Evaluation Metrics 4.4 Implementation Details 4.4 Implementation Details 5 Experimental Results and Analyses 5 Experimental Results and Analyses 6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References 6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References A Additional Experimental Setups A Additional Experimental Setups B Additional Experimental Results B Additional Experimental Results 6 Conclusion In this work, we proposed the Adaptive RetrievalAugmented Generation framework, referred to as Adaptive-RAG, to handle queries of various complexities. Specifically, Adaptive-RAG is designed to dynamically adjust its query handling strategies in the unified retrieval-augmented LLM based on the complexity of queries that they encounter, which spans across a spectrum of the nonretrieval-based approach for the most straightforward queries, to the single-step approach for the queries of moderate complexity, and finally to the multi-step approach for the complex queries. The core step of our Adaptive-RAG lies in determining the complexity of the given query, which is instrumental in selecting the most suitable strategy for its answer. To operationalize this process, we trained a smaller Language Model with querycomplexity pairs, which are automatically annotated from the predicted outcomes and the inductive biases in datasets. We validated our Adaptive-RAG on a collection of open-domain QA datasets, covering the multiple query complexities including both the single- and multi-hop questions. The results demonstrate that our Adaptive-RAG enhances the overall accuracy and efficiency of QA systems, allocating more resources to handle complex queries while efficiently handling simpler queries, compared to the existing one-size-fits-all approaches that tend to be either minimalist or maximalist over varying query complexities. Limitations While our Adaptive-RAG shows clear advantages in effectiveness and efficiency by determining the query complexity and then leveraging the most suitable approach for tackling it, it is important to recognize that there still exist potential avenues for improving the classifier from the perspectives of its training datasets and architecture. Specifically, as there are no available datasets for training the query-complexity classifier, we automatically create new data based on the model prediction outcomes and the inductive dataset biases. However, our labeling process is one specific instantiation of labeling the query complexity, and it may have the potential to label queries incorrectly despite its effectiveness. Therefore, future work may create new datasets that are annotated with a diverse range of query complexities, in addition to the labels of question-answer pairs. Also, as the performance gap between the ideal classifier in Table 1 and the current classifier in Figure 3 indicates, there is still room to improve the effectiveness of the classifier. In other words, our classifier design based on the smaller LM is the initial, simplest instantiation for classifying the query complexity, and based upon it, future work may improve the classifier architecture and its performance, which will positively contribute to the overall QA performance. Ethics Statement The experimental results on Adaptive-RAG validate its applicability in realistic scenarios, where a wide range of diverse user queries exist. Nonetheless, given the potential diversity of real-world user inputs, it is crucial to also consider scenarios where these inputs might be offensive or harmful. We should be aware that such inputs could lead to the retrieval of offensive documents and the generation of inappropriate responses by the retrieval augmented LLMs. To address this challenge, developing methods to detect and manage offensive or inappropriate content in both user inputs and retrieved documents within the retrieval-augmented framework is essential. We believe that this is a critical area for future work. Acknowledgements This work was supported by Institute for Information and communications Technology Promotion (IITP) grant funded by the Korea government (No. 2018-0-00582, Prediction and augmentation of the credibility distribution via linguistic analysis and automated evidence document collection), Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2023-00275747), and the Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT (MSIT, Korea) & Gwangju Metropolitan City. References Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan A. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations. Jinheon Baek, Soyeong Jeong, Minki Kang, Jong Park, and Sung Ju Hwang. 2023. Knowledge-augmented language model verification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1720–1736. Association for Computational Linguistics. Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. 2022. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer opendomain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1870–1879. Association for Computational Linguistics. Sukmin Cho, Jeongyeon Seo, Soyeong Jeong, and Jong C. Park. 2023. Improving zero-shot reader by reducing distractions from irrelevant documents in open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3145– 3157. Association for Computational Linguistics. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6609–6625. International Committee on Computational Linguistics. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 874– 880. Association for Computational Linguistics. Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res., 24:251:1–251:43. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. 2023. Test-time self-adaptive small language models for question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 15459–15469. Association for Computational Linguistics. Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. In EMNLP 2023. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, November 16-20, 2020. Association for Computational Linguistics. Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir R. Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. 2022. Realtime QA: what’s the answer right now? arXiv preprint arXiv:2207.13332. Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-searchpredict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv.2212.14024, abs/2212.14024. Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2023. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466. Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internetaugmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115. Belinda Z. Li, Sewon Min, Srinivasan Iyer, Yashar Mehdad, and Wen-tau Yih. 2020. Efficient one-pass end-to-end entity linking for questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6433–6441. Association for Computational Linguistics. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802–9822. Association for Computational Linguistics. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, pages 8024–8035. Jayr Alencar Pereira, Robson do Nascimento Fidalgo, Roberto de Alencar Lotufo, and Rodrigo Frassetto Nogueira. 2023. Visconde: Multi-document QA with GPT-3 and neural reranking. In Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2-6, 2023, Proceedings, Part II, volume 13981 of Lecture Notes in Computer Science, pages 534–543. Springer. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023. Peng Qi, Haejun Lee, Tg Sido, and Christopher D. Manning. 2021. Answering open-domain questions of varying reasoning steps from text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3599–3614. Association for Computational Linguistics. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics. Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109– 126. National Institute of Standards and Technology (NIST). Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. REPLUG: retrievalaugmented black-box language models. arXiv preprint arXiv:2301.12652. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian CantonFerrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and finetuned chat models. arXiv preprint arXiv:2307.09288. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022a. Musique: Multihop questions via single-hop question composition. Trans. Assoc. Comput. Linguistics, 10:539–554. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022b. ♪ MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieval with chain-of-thought reasoning for knowledgeintensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10014–10037. Association for Computational Linguistics. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022b. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, pages 38– 45. Association for Computational Linguistics. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. End-to-end open-domain question answering with bertserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Demonstrations, pages 72–77. Association for Computational Linguistics. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. Fengbin Zhu, Wenqiang Lei, Chao Wang, Jianming Zheng, Soujanya Poria, and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774. A Additional Experimental Setups A.1 Datasets We use publicly open datasets for both singlehop and multi-hop QA datasets, referring to as Karpukhin et al. (2020) and Trivedi et al. (2023), respectively. We describe the characteristics of each dataset: SQuAD v1.1 (Rajpurkar et al., 2016) is created through a process where annotators write questions based on the documents they read. Natural Questions (Kwiatkowski et al., 2019) is constructed by real user queries on Google Search. TriviaQA (Joshi et al., 2017) comprises trivia questions sourced from various quiz websites. MuSiQue (Trivedi et al., 2022a) is collected by compositing multiple single-hop queries, to form queries spanning 2-4 hops. HotpotQA (Yang et al., 2018) is constructed by having annotators create questions that link multiple Wikipedia articles. 2WikiMultiHopQA (Ho et al., 2020) is derived from Wikipedia and its associated knowledge graph path, needing 2-hops. SQuAD v1.1 (Rajpurkar et al., 2016) is created through a process where annotators write questions based on the documents they read. SQuAD v1.1 (Rajpurkar et al., 2016) is created through a process where annotators write questions based on the documents they read. SQuAD v1.1 Natural Questions (Kwiatkowski et al., 2019) is constructed by real user queries on Google Search. Natural Questions (Kwiatkowski et al., 2019) is constructed by real user queries on Google Search. Natural Questions TriviaQA (Joshi et al., 2017) comprises trivia questions sourced from various quiz websites. TriviaQA (Joshi et al., 2017) comprises trivia questions sourced from various quiz websites. TriviaQA MuSiQue (Trivedi et al., 2022a) is collected by compositing multiple single-hop queries, to form queries spanning 2-4 hops. MuSiQue (Trivedi et al., 2022a) is collected by compositing multiple single-hop queries, to form queries spanning 2-4 hops. MuSiQue HotpotQA (Yang et al., 2018) is constructed by having annotators create questions that link multiple Wikipedia articles. HotpotQA (Yang et al., 2018) is constructed by having annotators create questions that link multiple Wikipedia articles. HotpotQA 2WikiMultiHopQA (Ho et al., 2020) is derived from Wikipedia and its associated knowledge graph path, needing 2-hops. 2WikiMultiHopQA (Ho et al., 2020) is derived from Wikipedia and its associated knowledge graph path, needing 2-hops. 2WikiMultiHopQA A.2 Models We describe the details of models as follows: No Retrieval. This approach uses only the LLM itself, to generate the answer to the given query. Single-step Approach. This approach first retrieves the relevant knowledge with the given query from the external knowledge sources and then augments the LLM with this retrieved knowledge to generate the answer, which iterates only once. Adaptive Retrieval. This baseline (Mallen et al., adaptively augments the LLM with the retrieval module, only when the entities appearing in queries are less popular. To extract entities, we use the available entity-linking method (Li et al., 2020), namely BLINK, for questions. Self-RAG. This baseline (Asai et al., 2024) trains the LLM to adaptively perform retrieval and generation, where the retrieval is conducted once it predicts the special retrieval token above a certain threshold, and the answer generation follows. Adaptive-RAG. This is our model that adaptively selects the retrieval-augmented generation strategy, smoothly oscillating between the non retrieval, single-step approach, and multi-step approaches[4] without architectural changes, based on the query complexity assessed by the classifier. Multi-step Approach. This approach (Trivedi et al., 2023) is the multi-step retrieval-augmented LLM, which iteratively accesses both the retriever and LLM with interleaved Chain-of-Thought reasoning (Wei et al., 2022b) repeatedly until it derives the solution or reaches the maximum step number. Adaptive-RAG w/ Oracle This is an ideal scenario of our Adaptive-RAG equipped with an oracle classifier that perfectly categorizes the query complexity. No Retrieval. This approach uses only the LLM itself, to generate the answer to the given query. No Retrieval. This approach uses only the LLM itself, to generate the answer to the given query. No Retrieval. Single-step Approach. This approach first retrieves the relevant knowledge with the given query from the external knowledge sources and then augments the LLM with this retrieved knowledge to generate the answer, which iterates only once. Single-step Approach. This approach first retrieves the relevant knowledge with the given query from the external knowledge sources and then augments the LLM with this retrieved knowledge to generate the answer, which iterates only once. Single-step Approach. Adaptive Retrieval. This baseline (Mallen et al., adaptively augments the LLM with the retrieval module, only when the entities appearing in queries are less popular. To extract entities, we use the available entity-linking method (Li et al., 2020), namely BLINK, for questions. Adaptive Retrieval. This baseline (Mallen et al., adaptively augments the LLM with the retrieval module, only when the entities appearing in queries are less popular. To extract entities, we use the available entity-linking method (Li et al., 2020), namely BLINK, for questions. Adaptive Retrieval. Self-RAG. This baseline (Asai et al., 2024) trains the LLM to adaptively perform retrieval and generation, where the retrieval is conducted once it predicts the special retrieval token above a certain threshold, and the answer generation follows. Self-RAG. This baseline (Asai et al., 2024) trains the LLM to adaptively perform retrieval and generation, where the retrieval is conducted once it predicts the special retrieval token above a certain threshold, and the answer generation follows. Self-RAG. Adaptive-RAG. This is our model that adaptively selects the retrieval-augmented generation strategy, smoothly oscillating between the non retrieval, single-step approach, and multi-step approaches[4] without architectural changes, based on the query complexity assessed by the classifier. Adaptive-RAG. This is our model that adaptively selects the retrieval-augmented generation strategy, smoothly oscillating between the non retrieval, single-step approach, and multi-step approaches[4] without architectural changes, based on the query complexity assessed by the classifier. Adaptive-RAG. Multi-step Approach. This approach (Trivedi et al., 2023) is the multi-step retrieval-augmented LLM, which iteratively accesses both the retriever and LLM with interleaved Chain-of-Thought reasoning (Wei et al., 2022b) repeatedly until it derives the solution or reaches the maximum step number. Multi-step Approach. This approach (Trivedi et al., 2023) is the multi-step retrieval-augmented LLM, which iteratively accesses both the retriever and LLM with interleaved Chain-of-Thought reasoning (Wei et al., 2022b) repeatedly until it derives the solution or reaches the maximum step number. Multi-step Approach. Adaptive-RAG w/ Oracle This is an ideal scenario of our Adaptive-RAG equipped with an oracle classifier that perfectly categorizes the query complexity. Adaptive-RAG w/ Oracle This is an ideal scenario of our Adaptive-RAG equipped with an oracle classifier that perfectly categorizes the query complexity. Adaptive-RAG w/ Oracle A.3 Implementation Details For computing resources, we use A100 GPUs with 80GB memory. In addition, due to the significant costs associated with evaluating retrievalaugmented generation models, we perform experiments with a single run. Finally, we implemented models using PyTorch (Paszke et al., 2019) and Transformers library (Wolf et al., 2020). B Additional Experimental Results Performance vs Time We further provide a comparison of different retrieval-augmented generation approaches with FLAN-T5-XL and FLAN-T5- XXL models in Figure 4 and Figure 5, respectively, in the context of performance and efficiency tradeoffs. Similar to the observation made from the GPT3.5 model in Figure 1, our proposed Adaptive-RAG is significantly more effective as well as efficient. Performance vs Time Performance per Dataset In addition to detailing the performance of each dataset with the FLANT5-XL model, as shown in Table 2, we also present the results for each dataset with the FLAN-T5- XXL and GPT-3.5 models in Table 2 and Table 8, respectively. The experimental results show that our Adaptive-RAG consistently balances between efficiency and accuracy. It is worth noting that while the GPT-3.5 model performs effectively in addressing straightforward queries even without document retrieval, it benefits significantly from our Adaptive-RAG in terms of effectiveness when solving complex multi-hop queries. Performance per Dataset This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv [4] For the multi-step approach, we use the state-of-the-art question answering strategy from IRCoT (Trivedi et al., 2023).