Improving Text Embeddings with Large Language Models: Instructions for Training and Evaluation

Written by autoencoder | Published 2024/10/10
Tech Story Tags: multilingual-ai | text-embeddings | synthetic-data-generation | natural-language-processing | contrastive-pre-training | language-models | beir-benchmark | ai-for-information-retrieval

TLDR

via the TL;DR App

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Table of Links

Abstract and 1 Introduction

3 Method

3.1 Synthetic Data Generation

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

D Instructions for Training and Evaluation

We manually write instructions for training datasets, as listed in Table 13. For evaluation datasets, the instructions are listed in Table 14.

This paper is available on arxiv under CC0 1.0 DEED license.

Written by autoencoder | Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Published by HackerNoon on 2024/10/10