Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References A Implementation Details B Test Set Contamination Analysis C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation 5.2 Extending to Long Text Embeddings 5.3 Analysis of Training Hyperparameters Table 4 presents the results under different configurations. We notice that the Mistral-7B initialization holds an advantage over LLaMA-2 7B, in line with the findings from Mistral-7B technical report [19]. The choice of pooling types and LoRA ranks does not affect the overall performance substantially, hence we adhere to the default setting despite the marginal superiority of LoRA rank 8. On the other hand, the way of adding instructions has a considerable impact on the performance. We conjecture that natural language instructions better inform the model regarding the embedding task at hand, and thus enable the model to generate more discriminative embeddings. Our framework also provides a way to customize the behavior of text embeddings through instructions without the need to fine-tune the model or re-built document index. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Authors: Authors: (1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com); (2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com); (3) Xiaolong Huang, Microsoft Corporation; (4) Linjun Yang, Microsoft Corporation; (5) Rangan Majumder, Microsoft Corporation; (6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com). Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Method 3.1 Synthetic Data Generation 3.1 Synthetic Data Generation 3.2 Training 3.2 Training 4 Experiments 4.1 Statistics of the Synthetic Data 4.1 Statistics of the Synthetic Data 4.2 Model Fine-tuning and Evaluation 4.2 Model Fine-tuning and Evaluation 4.3 Main Results 4.3 Main Results 4.4 Multilingual Retrieval 4.4 Multilingual Retrieval 5 Analysis 5.1 Is Contrastive Pre-training Necessary? 5.1 Is Contrastive Pre-training Necessary? 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters 6 Conclusion and References 6 Conclusion and References A Implementation Details A Implementation Details B Test Set Contamination Analysis B Test Set Contamination Analysis C Prompts for Synthetic Data Generation C Prompts for Synthetic Data Generation D Instructions for Training and Evaluation D Instructions for Training and Evaluation 5.2 Extending to Long Text Embeddings 5.3 Analysis of Training Hyperparameters Table 4 presents the results under different configurations. We notice that the Mistral-7B initialization holds an advantage over LLaMA-2 7B, in line with the findings from Mistral-7B technical report [19]. The choice of pooling types and LoRA ranks does not affect the overall performance substantially, hence we adhere to the default setting despite the marginal superiority of LoRA rank 8. On the other hand, the way of adding instructions has a considerable impact on the performance. We conjecture that natural language instructions better inform the model regarding the embedding task at hand, and thus enable the model to generate more discriminative embeddings. Our framework also provides a way to customize the behavior of text embeddings through instructions without the need to fine-tune the model or re-built document index. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv