Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References A Implementation Details B Test Set Contamination Analysis C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation A Implementation Details The model and dataset release information is available at https://github.com/microsoft/ unilm/tree/master/e5. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Authors: Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.1 Synthetic Data Generation 3.2 Training 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.3 Main Results 4.4 Multilingual Retrieval 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References 6 Conclusion and References A Implementation Details A Implementation Details B Test Set Contamination Analysis B Test Set Contamination Analysis C Prompts for Synthetic Data Generation C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation D Instructions for Training and Evaluation A Implementation Details The model and dataset release information is available at https://github.com/microsoft/ unilm/tree/master/e5. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv