paint-brush
Improving Text Embeddings with Large Language Models: Related Workby@autoencoder

Improving Text Embeddings with Large Language Models: Related Work

by Auto Encoder: How to Ignore the Signal Noise
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and...

October 9th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces a novel method for generating high-quality text embeddings using synthetic data, achieving state-of-the-art results with minimal training
featured image - Improving Text Embeddings with
Large Language Models: Related Work
1x
Read by Dr. One voice-avatar

Listen to this story

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Learn More
LEARN MORE ABOUT @AUTOENCODER'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

Text Embeddings are continuous low-dimensional representations of text and have been extensively applied to various downstream tasks such as information retrieval, question answering, and retrievalaugmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10] and weighted average of word embeddings [25]. More recent methods exploit supervision from natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in terms of task diversity and language coverage. To address this challenge, methods like Contriever [18], OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain state-of-the-art text embeddings with single-stage training.


Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research, with various methods proposed to enhance retrieval systems with artificially created data. For instance, Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents, which are then leveraged for document expansion or model training. GPL [45] employs a crossencoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our approach does not rely on any unlabeled documents or queries and thus can generate more diverse synthetic data.


Another related line of work focuses on knowledge distillation from black-box LLMs by training on synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using high-quality synthetic data from GPT-3.5/4 [34].


Large Language Models With the popularization of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4]. However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated by augmenting LLMs with information retrieved from external sources, a technique known as retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law of text embeddings empirically, but their performance still falls behind small bidirectional encoders such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text embeddings by exploiting the latest advances of LLMs and synthetic data.


Figure 1: An example two-step prompt template for generating synthetic data with GPT-4. We first prompt GPT-4 to brainstorm a list of potential retrieval tasks, and then generate (query, positive, hard negative) triplets for each task. “{...}” denotes a placeholder that will be replaced by sampling from a predefined set of values. Full prompts are available in Appendix C.

Figure 1: An example two-step prompt template for generating synthetic data with GPT-4. We first prompt GPT-4 to brainstorm a list of potential retrieval tasks, and then generate (query, positive, hard negative) triplets for each task. “{...}” denotes a placeholder that will be replaced by sampling from a predefined set of values. Full prompts are available in Appendix C.


This paper is available on arxiv under CC0 1.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise@autoencoder
Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X
X REMOVE AD