Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to ([email protected]);
(2) Nan Yang, Microsoft Corporation, and correspondence to ([email protected]);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Microsoft Corporation;
(5) Rangan Majumder, Microsoft Corporation;
(6) Furu Wei, Microsoft Corporation and Correspondence to ([email protected]).
3 Method
4 Experiments
4.1 Statistics of the Synthetic Data
4.2 Model Fine-tuning and Evaluation
5 Analysis
5.1 Is Contrastive Pre-training Necessary?
5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters
B Test Set Contamination Analysis
C Prompts for Synthetic Data Generation
D Instructions for Training and Evaluation
For asymmetric tasks, we list the four prompt templates in Table 7, 8, 9, and 10. For symmetric tasks, the prompts templates are available in Table 11 and 12. To generate multilingual data, we sample the value of “{language}” from the language list of XLM-R [7] with higher probability for high-resource languages. When prompting GPT-4/3.5, we set the temperature to 1.0 and the top-p to 1.0, which is higher than the default setting to encourage more diversity.
This paper is available on arxiv under CC0 1.0 DEED license.