Improving Text Embeddings with Large Language Models: Prompts for Synthetic Data Generation

Too Long; Didn't Read

This paper introduces a novel method for generating high-quality text embeddings using synthetic data, achieving state-of-the-art results with minimal training

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

C Prompts for Synthetic Data Generation

For asymmetric tasks, we list the four prompt templates in Table 7, 8, 9, and 10. For symmetric tasks, the prompts templates are available in Table 11 and 12. To generate multilingual data, we sample the value of “{language}” from the language list of XLM-R [7] with higher probability for high-resource languages. When prompting GPT-4/3.5, we set the temperature to 1.0 and the top-p to 1.0, which is higher than the default setting to encourage more diversity.

This paper is available on arxiv under CC0 1.0 DEED license.

Improving Text Embeddings with Large Language Models: Prompts for Synthetic Data Generation

Too Long; Didn't Read

Table of Links

C Prompts for Synthetic Data Generation

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Improving Text Embeddings with Large Language Models: Prompts for Synthetic Data Generation

Too Long; Didn't Read

Table of Links

C Prompts for Synthetic Data Generation

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics