Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Microsoft Corporation;
(5) Rangan Majumder, Microsoft Corporation;
(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References A Implementation Details B Test Set Contamination Analysis C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation 3 Method 3.1 Synthetic Data Generation Utilizing synthetic data generated by advanced LLMs such as GPT-4 presents a compelling opportunity, especially in terms of enhancing diversity across a multitude of tasks and languages. Such diversity is essential for developing robust text embeddings that can perform well across different tasks, be it semantic retrieval, textual similarity, or clustering. To generate diverse synthetic data, we propose a simple taxonomy that categorizes embedding tasks into several groups, and then apply different prompt templates to each group. Asymmetric Tasks This category comprises tasks where the query and document are semantically related but are not paraphrases of each other. Depending on the length of the query and document, we further divide asymmetric tasks into four subgroups: short-long match, long-short match, short-short match, and long-long match. For instance, short-long match tasks involve a short query and a long document, which is a typical scenario in commercial search engines. For each subgroup, we design a two-step prompt template that first prompts LLMs brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition. In Figure 1, we show an example prompt for the short-long match subgroup. The outputs from GPT-4 are mostly coherent and of high quality. In our preliminary experiments, we also attempted to generate the task definition and query-document pairs using a single prompt, but the data diversity was not as satisfactory as the proposed two-step approach. Symmetric Tasks Symmetric tasks involve queries and documents that have similar semantic meanings but different surface forms. We examine two application scenarios: monolingual semantic textual similarity (STS) and bitext retrieval. We design two distinct prompt templates for each scenario, tailored to their specific objectives. Since the task definition is straightforward, we omit the brainstorming step for symmetric tasks. To further boost the diversity of the prompts and thus the synthetic data, we incorporate several placeholders in each prompt template, whose values are randomly sampled at runtime. For example, in Figure 1, the value of “{query_length}” is sampled from the set “{less than 5 words, 5-10 words, at least 10 words}”. To generate multilingual data, we sample the value of “{language}” from the language list of XLMR [7], giving more weight to high-resource languages. Any generated data that does not conform to the predefined JSON format are discarded during the parsing process. We also remove duplicates based on exact string matching. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Authors: Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.1 Synthetic Data Generation 3.2 Training 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.3 Main Results 4.4 Multilingual Retrieval 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References 6 Conclusion and References A Implementation Details A Implementation Details B Test Set Contamination Analysis B Test Set Contamination Analysis C Prompts for Synthetic Data Generation C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation D Instructions for Training and Evaluation 3 Method 3.1 Synthetic Data Generation Utilizing synthetic data generated by advanced LLMs such as GPT-4 presents a compelling opportunity, especially in terms of enhancing diversity across a multitude of tasks and languages. Such diversity is essential for developing robust text embeddings that can perform well across different tasks, be it semantic retrieval, textual similarity, or clustering. To generate diverse synthetic data, we propose a simple taxonomy that categorizes embedding tasks into several groups, and then apply different prompt templates to each group. Asymmetric Tasks This category comprises tasks where the query and document are semantically related but are not paraphrases of each other. Depending on the length of the query and document, we further divide asymmetric tasks into four subgroups: short-long match, long-short match, short-short match, and long-long match. For instance, short-long match tasks involve a short query and a long document, which is a typical scenario in commercial search engines. For each subgroup, we design a two-step prompt template that first prompts LLMs brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition. In Figure 1, we show an example prompt for the short-long match subgroup. The outputs from GPT-4 are mostly coherent and of high quality. In our preliminary experiments, we also attempted to generate the task definition and query-document pairs using a single prompt, but the data diversity was not as satisfactory as the proposed two-step approach. Asymmetric Tasks Symmetric Tasks Symmetric tasks involve queries and documents that have similar semantic meanings but different surface forms. We examine two application scenarios: monolingual semantic textual similarity (STS) and bitext retrieval. We design two distinct prompt templates for each scenario, tailored to their specific objectives. Since the task definition is straightforward, we omit the brainstorming step for symmetric tasks. Symmetric Tasks To further boost the diversity of the prompts and thus the synthetic data, we incorporate several placeholders in each prompt template, whose values are randomly sampled at runtime. For example, in Figure 1, the value of “{query_length}” is sampled from the set “{less than 5 words, 5-10 words, at least 10 words}”. To generate multilingual data, we sample the value of “{language}” from the language list of XLMR [7], giving more weight to high-resource languages. Any generated data that does not conform to the predefined JSON format are discarded during the parsing process. We also remove duplicates based on exact string matching. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Improving Text Embeddings with Large Language Models: Synthetic Data Generation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

12 Key Aspects for Assessing the Power of Text-to-Image Models

Improving Text Embeddings with Large Language Models: Abstract and Introduction

Improving Text Embeddings with Large Language Models: Related Work

Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data

Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation

Improving Text Embeddings with Large Language Models: Main Results

12 Key Aspects for Assessing the Power of Text-to-Image Models

Improving Text Embeddings with Large Language Models: Abstract and Introduction

Improving Text Embeddings with Large Language Models: Related Work

Improving Text Embeddings with Large Language Models: Statistics of the Synthetic Data

Improving Text Embeddings with Large Language Models: Model Fine-tuning and Evaluation

Improving Text Embeddings with Large Language Models: Main Results

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps