paint-brush
Improving Text Embeddings with Large Language Models: Test Set Contamination Analysisby@autoencoder

Improving Text Embeddings with Large Language Models: Test Set Contamination Analysis

by Auto Encoder: How to Ignore the Signal Noise
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and...

October 9th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper introduces a novel method for generating high-quality text embeddings using synthetic data, achieving state-of-the-art results with minimal training
featured image - Improving Text Embeddings with
Large Language Models: Test Set Contamination Analysis
1x
Read by Dr. One voice-avatar

Listen to this story

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise

Auto Encoder: How to Ignore the Signal Noise

@autoencoder

Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Learn More
LEARN MORE ABOUT @AUTOENCODER'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

B Test Set Contamination Analysis

To assess the test set contamination on all the datasets in the MTEB benchmark, we perform a string match based analysis between the test set and our training set, disregarding differences in character case and spacing. We categorize the train-test overlaps into three types:


• Low entropy texts. These are texts such as “i need a coffee” and “what does that mean”, which are not considered as contamination because they are common expressions that can occur in various contexts.


• Question overlap. We identify 4 test set questions in the DBPedia dataset that also appear in the TriviaQA training set. Given that they constitute a minor portion of the test set, their impact on the overall performance is insignificant.


• Retrieval corpus overlap. Several retrieval datasets share the same retrieval corpus. For instance, the DBPedia, NQ, and TriviaQA datasets all use Wikipedia passages, even though their query sets are different. This is a standard evaluation practice in the field of information retrieval, and we do not regard it as contamination.


In summary, we did not detect substantial contamination risks that could alter the main findings of this paper.


Another aspect to consider is the possibility of test set contamination in the training data of Mistral7B and GPT-4. However, since the training data of these models is not publicly accessible, it is challenging to estimate the degree of such contamination. Given their widespread use in the research community, we believe it is still a valid comparison if other works also employ these models.


Table 5: nDCG@10 and Recall@100 on the dev set of the MIRACL dataset for all 18 languages.

Table 5: nDCG@10 and Recall@100 on the dev set of the MIRACL dataset for all 18 languages.


image


Table 7: Prompt template for the short-long matching subgroup. For placeholders, “{query_type}” ∈ {extremely long-tail, long-tail, common}, “{query_length}” ∈ {less than 5 words, 5 to 15 words, at least 10 words}, “{difficulty}” ∈ {high school, college, PhD}, “{clarity}” ∈ {clear, understandable with some effort, ambiguous}, “{num_words}” ∈ {50, 100, 200, 300, 400, 500}.

Table 7: Prompt template for the short-long matching subgroup. For placeholders, “{query_type}” ∈ {extremely long-tail, long-tail, common}, “{query_length}” ∈ {less than 5 words, 5 to 15 words, at least 10 words}, “{difficulty}” ∈ {high school, college, PhD}, “{clarity}” ∈ {clear, understandable with some effort, ambiguous}, “{num_words}” ∈ {50, 100, 200, 300, 400, 500}.


This paper is available on arxiv under CC0 1.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
Auto Encoder: How to Ignore the Signal Noise@autoencoder
Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X
X REMOVE AD