Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Background and Related work 2.1 Web Scale Information Retrieval 2.1 Web Scale Information Retrieval 2.2 Existing Datasets 2.2 Existing Datasets 3 MS Marco Web Search Dataset and 3.1 Document Preparation 3 MS Marco Web Search Dataset and 3.1 Document Preparation 3.2 Query Selection and Labeling 3.2 Query Selection and Labeling 3.3 Dataset Analysis 3.3 Dataset Analysis 3.4 New Challenges Raised by MS MARCO Web Search 3.4 New Challenges Raised by MS MARCO Web Search 4 Benchmark Results and 4.1 Environment Setup 4 Benchmark Results and 4.1 Environment Setup 4.2 Baseline Methods 4.2 Baseline Methods 4.3 Evaluation Metrics 4.3 Evaluation Metrics 4.4 Evaluation of Embedding Models and 4.5 Evaluation of ANN Algorithms 4.4 Evaluation of Embedding Models and 4.5 Evaluation of ANN Algorithms 4.6 Evaluation of End-to-end Performance 4.6 Evaluation of End-to-end Performance 5 Potential Biases and Limitations 5 Potential Biases and Limitations 6 Future Work and Conclusions, and References 6 Future Work and Conclusions, and References 4.2 Baseline Methods 4.2.1 State-of-the-art Embedding Models. We take the following three state-of-the-art multi-lingual models as the initial baseline models: 4.2.1 State-of-the-art Embedding Models. • DPR [27] is based on a BERT pre-trained model and a dualencoder architecture, whose embedding is optimized for maximizing the inner product score of a query and its relevant passage. The negative training examples are selected by BM25 retrieved documents. • ANCE [54] improves embedding-based retrieval performance by selecting hard negative training examples from the entire document set using an asynchronously updated approximate nearest neighbor index. • SimANS [57] avoids over indexing on false negatives by selecting ambiguous samples rather than the hardest ones. We aspire for the MS MARCO Web Search dataset to establish itself as a new standard benchmark for retrieval, enticing more baseline models to assess and experiment with it. 4.2.2 State-of-the-art ANN Algorithms. For embedding retrieval algorithms, we choose the state-of-the-art disk-based ANN algorithms DiskANN [22] and SPANN [11] as the baselines. DiskANN is the first disk-based ANN algorithm that can effectively serve billion-scale vector search with low resource cost. It adopts the neighborhood graph solution which stores the graph and the fullprecision vectors on the disk, while putting compressed vectors (e.g., through Product Quantization [23]) and some pivot points in memory. During the search, a query follows the best-first traversal principle to start the search from a fixed point in the graph. 4.2.2 State-of-the-art ANN Algorithms. SPANN adopts a hierarchical balanced clustering technique to fast partition the whole dataset into a large number of posting lists which are kept in the disk. It only stores the centroids of the posting lists in memory. To speed up the search, SPANN leverages a memory index to fast navigate the query to closest centroids, and then loads the corresponding posting lists from disk into memory for further fine-grained search. With several techniques like postings expansion with a maximum length constraint, boundary vector replication and query-aware dynamic pruning, it achieves state-of-the-art performance in multiple billion-scale datasets in terms of memory cost, result quality, and search latency. 4.2.3 End-to-end Retrieval Systems. BM25 [42] is the most widely used ranking function in the web information retrieval area to estimate the relevance score of a document given a query. It ranks a set of documents based on a probabilistic retrieval framework and has been integrated into the Elasticsearch system to serve all kinds of search scenarios. 4.2.3 End-to-end Retrieval Systems. Authors: (1) Qi Chen, Microsoft Beijing, China; (2) Xiubo Geng, Microsoft Beijing, China; (3) Corby Rosset, Microsoft, Redmond, United States; (4) Carolyn Buractaon, Microsoft, Redmond, United States; (5) Jingwen Lu, Microsoft, Redmond, United States; (6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft; (7) Kun Zhou, Microsoft, Beijing, China; (8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft; (9) Yeyun Gong, Microsoft, Beijing, China; (10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft; (11) Nick Craswell, Microsoft, Redmond, United States; (12) Xing Xie, Microsoft, Beijing, China; (13) Fan Yang, Microsoft, Beijing, China; (14) Bryan Tower, Microsoft, Redmond, United States; (15) Nikhil Rao, Microsoft, Mountain View, United States; (16) Anlei Dong, Microsoft, Mountain View, United States; (17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland; (18) Zheng Liu, Microsoft, Beijing, China; (19) Mingqin Li, Microsoft, Redmond, United States; (20) Chuanjie Liu, Microsoft, Beijing, China; (21) Zengzhong Li, Microsoft, Redmond, United States; (22) Rangan Majumder, Microsoft, Redmond, United States; (23) Jennifer Neville, Microsoft, Redmond, United States; (24) Andy Oakley, Microsoft, Redmond, United States; (25) Knut Magne Risvik, Microsoft, Oslo, Norway; (26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India; (27) Manik Varma, Microsoft, Bengaluru, India; (28) Yujing Wang, Microsoft, Beijing, China; (29) Linjun Yang, Microsoft, Redmond, United States; (30) Mao Yang, Microsoft, Beijing, China; (31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft. Authors: Authors: (1) Qi Chen, Microsoft Beijing, China; (2) Xiubo Geng, Microsoft Beijing, China; (3) Corby Rosset, Microsoft, Redmond, United States; (4) Carolyn Buractaon, Microsoft, Redmond, United States; (5) Jingwen Lu, Microsoft, Redmond, United States; (6) Tao Shen, University of Technology Sydney, Sydney, Australia and the work was done at Microsoft; (7) Kun Zhou, Microsoft, Beijing, China; (8) Chenyan Xiong, Carnegie Mellon University, Pittsburgh, United States and the work was done at Microsoft; (9) Yeyun Gong, Microsoft, Beijing, China; (10) Paul Bennett, Spotify, New York, United States and the work was done at Microsoft; (11) Nick Craswell, Microsoft, Redmond, United States; (12) Xing Xie, Microsoft, Beijing, China; (13) Fan Yang, Microsoft, Beijing, China; (14) Bryan Tower, Microsoft, Redmond, United States; (15) Nikhil Rao, Microsoft, Mountain View, United States; (16) Anlei Dong, Microsoft, Mountain View, United States; (17) Wenqi Jiang, ETH Zürich, Zürich, Switzerland; (18) Zheng Liu, Microsoft, Beijing, China; (19) Mingqin Li, Microsoft, Redmond, United States; (20) Chuanjie Liu, Microsoft, Beijing, China; (21) Zengzhong Li, Microsoft, Redmond, United States; (22) Rangan Majumder, Microsoft, Redmond, United States; (23) Jennifer Neville, Microsoft, Redmond, United States; (24) Andy Oakley, Microsoft, Redmond, United States; (25) Knut Magne Risvik, Microsoft, Oslo, Norway; (26) Harsha Vardhan Simhadri, Microsoft, Bengaluru, India; (27) Manik Varma, Microsoft, Bengaluru, India; (28) Yujing Wang, Microsoft, Beijing, China; (29) Linjun Yang, Microsoft, Redmond, United States; (30) Mao Yang, Microsoft, Beijing, China; (31) Ce Zhang, ETH Zürich, Zürich, Switzerland and the work was done at Microsoft. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv