Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Microsoft Corporation;
(5) Rangan Majumder, Microsoft Corporation;
(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References A Implementation Details B Test Set Contamination Analysis C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation 2 Related Work Text Embeddings are continuous low-dimensional representations of text and have been extensively applied to various downstream tasks such as information retrieval, question answering, and retrievalaugmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10] and weighted average of word embeddings [25]. More recent methods exploit supervision from natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in terms of task diversity and language coverage. To address this challenge, methods like Contriever [18], OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain state-of-the-art text embeddings with single-stage training. Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research, with various methods proposed to enhance retrieval systems with artificially created data. For instance, Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents, which are then leveraged for document expansion or model training. GPL [45] employs a crossencoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our approach does not rely on any unlabeled documents or queries and thus can generate more diverse synthetic data. Another related line of work focuses on knowledge distillation from black-box LLMs by training on synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using high-quality synthetic data from GPT-3.5/4 [34]. Large Language Models With the popularization of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4]. However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated by augmenting LLMs with information retrieved from external sources, a technique known as retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law of text embeddings empirically, but their performance still falls behind small bidirectional encoders such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text embeddings by exploiting the latest advances of LLMs and synthetic data. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Authors: Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.1 Synthetic Data Generation 3.2 Training 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.3 Main Results 4.4 Multilingual Retrieval 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References 6 Conclusion and References A Implementation Details A Implementation Details B Test Set Contamination Analysis B Test Set Contamination Analysis C Prompts for Synthetic Data Generation C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation D Instructions for Training and Evaluation 2 Related Work Text Embeddings are continuous low-dimensional representations of text and have been extensively applied to various downstream tasks such as information retrieval, question answering, and retrievalaugmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10] and weighted average of word embeddings [25]. More recent methods exploit supervision from natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in terms of task diversity and language coverage. To address this challenge, methods like Contriever [18], OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain state-of-the-art text embeddings with single-stage training. Text Embeddings Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research, with various methods proposed to enhance retrieval systems with artificially created data. For instance, Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents, which are then leveraged for document expansion or model training. GPL [45] employs a crossencoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our approach does not rely on any unlabeled documents or queries and thus can generate more diverse synthetic data. Synthetic Data Another related line of work focuses on knowledge distillation from black-box LLMs by training on synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using high-quality synthetic data from GPT-3.5/4 [34]. Large Language Models With the popularization of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4]. However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated by augmenting LLMs with information retrieved from external sources, a technique known as retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law of text embeddings empirically, but their performance still falls behind small bidirectional encoders such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text embeddings by exploiting the latest advances of LLMs and synthetic data. Large Language Models This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Improving Text Embeddings with Large Language Models: Related Work

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

Improving Text Embeddings with Large Language Models: Abstract and Introduction

Improving Text Embeddings with Large Language Models: Synthetic Data Generation

Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data

Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation

Improving Text Embeddings with Large Language Models: Main Results

12 Key Aspects for Assessing the Power of Text-to-Image Models

Improving Text Embeddings with Large Language Models: Abstract and Introduction

Improving Text Embeddings with Large Language Models: Synthetic Data Generation

Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data

Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation

Improving Text Embeddings with Large Language Models: Main Results

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps