Authors:
(1) Xueguang Ma, David R. Cheriton School of Computer Science, University of Waterloo;
(2) Liang Wang, Microsoft Research;
(3) Nan Yang, Microsoft Research;
(4) Furu Wei, Microsoft Research;
(5) Jimmy Lin, David R. Cheriton School of Computer Science, University of Waterloo. Table of Links Abstract and Introduction Method Experiments Ablation Study and Analysis Related Work Conclusion, Acknowledgements and References Abstract The effectiveness of multi-stage text retrieval has been solidly demonstrated since before the era of pre-trained language models. However, most existing studies utilize models that predate recent advances in large language models (LLMs). This study seeks to explore potential improvements that state-of-the-art LLMs can bring. We conduct a comprehensive study, fine-tuning the latest LLaMA model both as a dense retriever (RepLLaMA) and as a pointwise reranker (RankLLaMA) for both passage retrieval and document retrieval using the MS MARCO datasets. Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models. Additionally, since LLMs can inherently handle longer contexts, they can represent entire documents holistically, obviating the need for traditional segmenting and pooling strategies. Furthermore, evaluations on BEIR demonstrate that our RepLLaMA–RankLLaMA pipeline exhibits strong zero-shot effectiveness. Model checkpoints from this study are available on HuggingFace.1 1 Introduction Text retrieval, which entails identifying and ranking the most relevant documents or text snippets in response to a query, is crucial in various opendomain language comprehension tasks (Petroni et al., 2021), including web search (Bajaj et al., 2016), open-domain question answering (Chen et al., 2017), and fact verification (Thorne et al., 2018). Retrieval also plays an important role in enhancing the effectiveness of large language models (LLMs) in a retrieval-augmented generation (RAG) pipeline (Lewis et al., 2020b; Shi et al., 2023). This approach not only mitigates hallucinations but also enables LLMs to access knowledge that is not captured within their parameters (Yang et al., 2023; Jiang et al., 2023). A typical multi-stage text retrieval pipeline consists of a retriever, designed to efficiently locate the top-k relevant texts from a corpus, and a reranker, which further refines the order of the retrieved candidates to improve output quality (Nogueira and Cho, 2019). Both retrievers and rerankers have significantly benefited from the advent of pre-trained language models based on Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020). These models are trained to encode queries and documents into vector representations for retrieval (Karpukhin et al., 2020; Lin, 2021) or to directly score the relevance between a query and a document for reranking (Nogueira et al., 2019; Zhuang et al., 2023). Recent large language models with billions of parameters, fine-tuned to follow instructions, such as InstructGPT (Ouyang et al., 2022), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a,b), have exhibited extraordinary capabilities in many NLP tasks, surpassing previous smaller pre-trained language models (Zhao et al., 2023). For retrieval, recent methods such as LRL (Ma et al., 2023), RankGPT (Sun et al., 2023), and PRP (Qin et al., 2023) have explored prompting LLMs to perform zero-shot reranking using pairwise or listwise approaches. These methods leverage LLMs by viewing reranking as text generation. However, we see a number of potential issues. First, these methods do not address the entire multistage pipeline, as it is challenging to cast retrieval from a large corpus as a text generation task. Second, they do not leverage labeled data when available. Finally, these rerankers are not efficient because they do not support parallel scoring and are slowed by their multi-pass decoding design. Therefore, we argue that fine-tuning state-ofthe-art large language models to function as retrievers and rerankers can yield better effectiveness than previous smaller models. This approach can also optimally utilize LLMs within multi-stage pipelines. Thus, we are motivated to investigate the following research question: How do state-ofthe-art large language models perform when specifically fine-tuned for multi-stage text retrieval? Our study aims to answer this question by conducting a comprehensive investigation into finetuning the latest LLaMA-2 model (Touvron et al., 2023b), a state-of-the-art, open-source large language model, as both a retriever and a reranker, which we refer to as RepLLaMA and RankLLaMA, respectively. Specifically, we utilize the MS MARCO (Bajaj et al., 2016) and BEIR (Thakur et al., 2021) datasets for our experiments. Our findings suggest that large language models surpass previous smaller models, achieving state-of-the-art effectiveness for both retrieval and reranking through a straightforward training regime and exhibiting strong zero-shot effectiveness. Furthermore, we observe that LLMs, which are inherently pre-trained on longer contexts, demonstrate potential in representing entire documents, thereby eliminating the need for traditional segmenting and pooling strategies for document retrieval. This paper is available on arxiv under CC 4.0 license. 1 https://huggingface.co/castorini Authors: (1) Xueguang Ma, David R. Cheriton School of Computer Science, University of Waterloo; (2) Liang Wang, Microsoft Research; (3) Nan Yang, Microsoft Research; (4) Furu Wei, Microsoft Research; (5) Jimmy Lin, David R. Cheriton School of Computer Science, University of Waterloo. Authors: Authors: (1) Xueguang Ma, David R. Cheriton School of Computer Science, University of Waterloo; (2) Liang Wang, Microsoft Research; (3) Nan Yang, Microsoft Research; (4) Furu Wei, Microsoft Research; (5) Jimmy Lin, David R. Cheriton School of Computer Science, University of Waterloo. Table of Links Abstract and Introduction Abstract and Introduction Method Method Experiments Experiments Ablation Study and Analysis Ablation Study and Analysis Related Work Related Work Conclusion, Acknowledgements and References Conclusion, Acknowledgements and References Abstract The effectiveness of multi-stage text retrieval has been solidly demonstrated since before the era of pre-trained language models. However, most existing studies utilize models that predate recent advances in large language models (LLMs). This study seeks to explore potential improvements that state-of-the-art LLMs can bring. We conduct a comprehensive study, fine-tuning the latest LLaMA model both as a dense retriever (RepLLaMA) and as a pointwise reranker (RankLLaMA) for both passage retrieval and document retrieval using the MS MARCO datasets. Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models. Additionally, since LLMs can inherently handle longer contexts, they can represent entire documents holistically, obviating the need for traditional segmenting and pooling strategies. Furthermore, evaluations on BEIR demonstrate that our RepLLaMA–RankLLaMA pipeline exhibits strong zero-shot effectiveness. Model checkpoints from this study are available on HuggingFace.1 1 Introduction Text retrieval, which entails identifying and ranking the most relevant documents or text snippets in response to a query, is crucial in various opendomain language comprehension tasks (Petroni et al., 2021), including web search (Bajaj et al., 2016), open-domain question answering (Chen et al., 2017), and fact verification (Thorne et al., 2018). Retrieval also plays an important role in enhancing the effectiveness of large language models (LLMs) in a retrieval-augmented generation (RAG) pipeline (Lewis et al., 2020b; Shi et al., 2023). This approach not only mitigates hallucinations but also enables LLMs to access knowledge that is not captured within their parameters (Yang et al., 2023; Jiang et al., 2023). A typical multi-stage text retrieval pipeline consists of a retriever, designed to efficiently locate the top-k relevant texts from a corpus, and a reranker, which further refines the order of the retrieved candidates to improve output quality (Nogueira and Cho, 2019). Both retrievers and rerankers have significantly benefited from the advent of pre-trained language models based on Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020). These models are trained to encode queries and documents into vector representations for retrieval (Karpukhin et al., 2020; Lin, 2021) or to directly score the relevance between a query and a document for reranking (Nogueira et al., 2019; Zhuang et al., 2023). Recent large language models with billions of parameters, fine-tuned to follow instructions, such as InstructGPT (Ouyang et al., 2022), GPT-4 (OpenAI, 2023), and LLaMA (Touvron et al., 2023a,b), have exhibited extraordinary capabilities in many NLP tasks, surpassing previous smaller pre-trained language models (Zhao et al., 2023). For retrieval, recent methods such as LRL (Ma et al., 2023), RankGPT (Sun et al., 2023), and PRP (Qin et al., 2023) have explored prompting LLMs to perform zero-shot reranking using pairwise or listwise approaches. These methods leverage LLMs by viewing reranking as text generation. However, we see a number of potential issues. First, these methods do not address the entire multistage pipeline, as it is challenging to cast retrieval from a large corpus as a text generation task. Second, they do not leverage labeled data when available. Finally, these rerankers are not efficient because they do not support parallel scoring and are slowed by their multi-pass decoding design. Therefore, we argue that fine-tuning state-ofthe-art large language models to function as retrievers and rerankers can yield better effectiveness than previous smaller models. This approach can also optimally utilize LLMs within multi-stage pipelines. Thus, we are motivated to investigate the following research question: How do state-ofthe-art large language models perform when specifically fine-tuned for multi-stage text retrieval? Our study aims to answer this question by conducting a comprehensive investigation into finetuning the latest LLaMA-2 model (Touvron et al., 2023b), a state-of-the-art, open-source large language model, as both a retriever and a reranker, which we refer to as RepLLaMA and RankLLaMA, respectively. Specifically, we utilize the MS MARCO (Bajaj et al., 2016) and BEIR (Thakur et al., 2021) datasets for our experiments. Our findings suggest that large language models surpass previous smaller models, achieving state-of-the-art effectiveness for both retrieval and reranking through a straightforward training regime and exhibiting strong zero-shot effectiveness. Furthermore, we observe that LLMs, which are inherently pre-trained on longer contexts, demonstrate potential in representing entire documents, thereby eliminating the need for traditional segmenting and pooling strategies for document retrieval. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv 1 https://huggingface.co/castorini

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Optimizing Text Retrieval Pipelines with LLaMA Models

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Experiments

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Ablation Study and Analysis

Related Work on Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Conclusion, Acknowledgements and References

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Optimizing Text Retrieval Pipelines with LLaMA Models

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Experiments

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Ablation Study and Analysis

Related Work on Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Fine-Tuning LLaMA for Multi-Stage Text Retrieval: Conclusion, Acknowledgements and References

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps