paint-brush

This story draft by @escholar has not been reviewed by an editor, YET.

Models and Evaluation Metrics

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Authors:

(1) Soyeong Jeong, School of Computing;

(2) Jinheon Baek, Graduate School of AI;

(3) Sukmin Cho, School of Computing;

(4) Sung Ju Hwang, Korea Advanced Institute of Science and Technology;

(5) Jong C. Park, School of Computing.

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Method and 3.1 Preliminaries

3.2 Adaptive-RAG: Adaptive Retrieval-Augmented Generation

4 Experimental Setups and 4.1 Datasets

4.2 Models and 4.3 Evaluation Metrics

4.4 Implementation Details

5 Experimental Results and Analyses

6 Conclusion, Limitations, Ethics Statement, Acknowledgements, and References


A Additional Experimental Setups

B Additional Experimental Results

4.2 Models

We compare our Adaptive-RAG against relevant models, including three retrieval-augmented LLM strategies (in Section 3.1) and the adaptive retrieval approaches (Mallen et al., 2023; Asai et al., 2024), which can be grouped into one of three categories: Simple, Adaptive, and Complex. Specifically, Simple approaches include the 1) No Retrieval and 2) Single-step Approach-based methods. Adaptive approaches include the 3) Adaptive Retrieval (Mallen et al., 2023), 4) Self-RAG (Asai et al., 2024), and our 5) Adaptive-RAG, which can adaptively perform retrieval based on the question complexity. For the 6) Multi-step Approach, we use the most sophisticated state-of-the-art method (Trivedi et al., 2023), iteratively accessing both the retriever and LLM with Chain-of-Thought reasoning (Wei et al., 2022b), for every query. Note that models across different categories are not directly comparable. Yet, in the ideal setting, Adaptive approaches should be more effective than those in the Simple category while simultaneously being more efficient than the Complex one. Therefore, we also report the performance in an ideal scenario, 7) Adaptive-RAG w/ Oracle, using the oracle classifier with our Adaptive-RAG.

4.3 Evaluation Metrics

When it comes to evaluating adaptive models, it is essential to simultaneously consider both the task performance and efficiency along with their trade-offs. Thus, we report the results with five metrics, where three of them measure the effectiveness and the other two measure the efficiency. In particular, for effectiveness, we use F1, EM, and Accuracy (Acc), following the standard evaluation protocol (Mallen et al., 2023; Baek et al., 2023; Asai et al., 2024), where F1 measures the number of overlapping words between the predicted answer and the ground truth, EM measures whether they are the same, and Acc measures whether the predicted answer contains the ground-truth answer. For efficiency, we measure the number of retrieval-and-generate steps and the average time for answering each query relative to the one-step approach.


This paper is available on arxiv under CC0 1.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...