paint-brush
RAGged Edge: LLMs on a Retrieval Rollercoasterby@dataaugmentation

RAGged Edge: LLMs on a Retrieval Rollercoaster

by Data AugmentationMarch 25th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper is presented on adaptive retrieval-augmented LLM models evaluated against various baselines using comprehensive metrics and detailed implementation.
featured image - RAGged Edge: LLMs on a Retrieval Rollercoaster
Data Augmentation HackerNoon profile picture
0-item

Authors:

(1) Soyeong Jeong, School of Computing;

(2) Jinheon Baek, Graduate School of AI;

(3) Sukmin Cho, School of Computing;

(4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology;

(5) Jong C. Park, School of Computing.

Abstract and 1. Introduction

2 Related Work

3 Method and 3.1 Preliminaries

3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation

4 Experimental Setups and 4.1 Datasets

4.2 Models and 4.3 Evaluation Metrics

4.4 Implementation Details

5 Experimental Results and Analyses

6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References


A Additional Experimental Setups

B Additional Experimental Results

4.2 Models

We compare our Adaptive-RAG against relevant models, including three retrieval-augmented LLM strategies (in Section 3.1) and the adaptive retrieval approaches (Mallen et al., 2023; Asai et al., 2024), which can be grouped into one of three categories: Simple, Adaptive, and Complex. Specifically, Simple approaches include the 1) No Retrieval and 2) Single-step Approach-based methods. Adaptive approaches include the 3) Adaptive Retrieval (Mallen et al., 2023), 4) Self-RAG (Asai et al., 2024), and our 5) Adaptive-RAG, which can adaptively perform retrieval based on the question complexity. For the 6) Multi-step Approach, we use the most sophisticated state-of-the-art method (Trivedi et al., 2023), iteratively accessing both the retriever and LLM with Chain-of-Thought reasoning (Wei et al., 2022b), for every query. Note that models across different categories are not directly comparable. Yet, in the ideal setting, Adaptive approaches should be more effective than those in the Simple category while simultaneously being more efficient than the Complex one. Therefore, we also report the performance in an ideal scenario, 7) Adaptive-RAG w/ Oracle, using the oracle classifier with our Adaptive-RAG.

4.3 Evaluation Metrics

When it comes to evaluating adaptive models, it is essential to simultaneously consider both the task performance and efficiency along with their trade-offs. Thus, we report the results with five metrics, where three of them measure the effectiveness and the other two measure the efficiency. In particular, for effectiveness, we use F1, EM, and Accuracy (Acc), following the standard evaluation protocol (Mallen et al., 2023; Baek et al., 2023; Asai et al., 2024), where F1 measures the number of overlapping words between the predicted answer and the ground truth, EM measures whether they are the same, and Acc measures whether the predicted answer contains the ground-truth answer. For efficiency, we measure the number of retrieval-and-generate steps and the average time for answering each query relative to the one-step approach.


4.4 Implementation Details

For a fair comparison and following Mallen et al. (2023) and Trivedi et al. (2023), we use the same retriever, a term-based sparse retrieval model known as BM25 (Robertson et al., 1994), across all different models. For the external document corpus, we use different sources depending on the dataset type: the Wikipedia corpus preprocessed by Karpukhin et al. (2020) for single-hop datasets, and the preprocessed corpus by Trivedi et al. (2023) for multihop datasets. Regarding the LLMs that are used to generate answers, we use the FLAN-T5 series models (Chung et al., 2022) of XL with 3B parameters and XXL with 11B parameters, and the GPT-3.5 model (gpt-3.5-turbo-instruct). For the retrieval-augmented LLM design, we follow the implementation details from Trivedi et al. (2023), which include input prompts, instructions, and the number of test samples for evaluation (e.g., 500 samples per dataset). In our Adaptive-RAG, for the query-complexity classifier, we use and train the T5-Large model (Raffel et al., 2020). Specifically, the classifier is trained using the epoch that shows the best performance until 100 training iterations from the validation set, with the learning rate of 3e-5 and the AdamW (Loshchilov and Hutter, 2019) as an optimizer. Regarding its training data, we sample and annotate 400 queries from 6 datasets based on its inductive bias (single-hop for one-step approach and multi-hop for multi-step). In addition, we use predicted outcomes of three different strategies over 400 queries sampled from each dataset. Note that those queries used for classifier training do not overlap with the testing queries for QA.



Table 2: Results on each of a collection of datasets with FLAN-T5-XL (3B) as the LLM. We emphasize our results in bold.




Figure 3: Performance on QA and query-complexity assessment of different adaptive approaches for retrieval-augmented LLMs with FLAN-T5 XL (Left) and XXL (Center). For labeling the complexity of queries, we use the silver data annotated from the prediction outcomes of models (described in Section 3.2). We also provide the confusion matrix across three labels (Right).


This paper is available on arxiv under CC0 1.0 DEED license.