This story draft by @escholar has not been reviewed by an editor, YET.
Authors:
(1) Soyeong Jeong, School of Computing;
(2) Jinheon Baek, Graduate School of AI;
(3) Sukmin Cho, School of Computing;
(4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology;
(5) Jong C. Park, School of Computing.
3 Method and 3.1 Preliminaries
3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation
4 Experimental Setups and 4.1 Datasets
4.2 Models and 4.3 Evaluation Metrics
5 Experimental Results and Analyses
6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References
A Additional Experimental Setups
B Additional Experimental Results
We compare our Adaptive-RAG against relevant models, including three retrieval-augmented LLM strategies (in Section 3.1) and the adaptive retrieval approaches (Mallen et al., 2023; Asai et al., 2024), which can be grouped into one of three categories: Simple, Adaptive, and Complex. Specifically, Simple approaches include the 1) No Retrieval and 2) Single-step Approach-based methods. Adaptive approaches include the 3) Adaptive Retrieval (Mallen et al., 2023), 4) Self-RAG (Asai et al., 2024), and our 5) Adaptive-RAG, which can adaptively perform retrieval based on the question complexity. For the 6) Multi-step Approach, we use the most sophisticated state-of-the-art method (Trivedi et al., 2023), iteratively accessing both the retriever and LLM with Chain-of-Thought reasoning (Wei et al., 2022b), for every query. Note that models across different categories are not directly comparable. Yet, in the ideal setting, Adaptive approaches should be more effective than those in the Simple category while simultaneously being more efficient than the Complex one. Therefore, we also report the performance in an ideal scenario, 7) Adaptive-RAG w/ Oracle, using the oracle classifier with our Adaptive-RAG.
When it comes to evaluating adaptive models, it is essential to simultaneously consider both the task performance and efficiency along with their trade-offs. Thus, we report the results with five metrics, where three of them measure the effectiveness and the other two measure the efficiency. In particular, for effectiveness, we use F1, EM, and Accuracy (Acc), following the standard evaluation protocol (Mallen et al., 2023; Baek et al., 2023; Asai et al., 2024), where F1 measures the number of overlapping words between the predicted answer and the ground truth, EM measures whether they are the same, and Acc measures whether the predicted answer contains the ground-truth answer. For efficiency, we measure the number of retrieval-and-generate steps and the average time for answering each query relative to the one-step approach.
This paper is available on arxiv under CC0 1.0 DEED license.