This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Michael Günther, michael.guenther;
(2) Jackmin Ong, jackmin.ong;
(3) Isabelle Mohr, isabelle.mohr;
(4) Alaeddine Abdessalem, alaeddine.abdessalem;
(5) Tanguy Abel, tanguy.abel;
(6) Mohammad Kalim Akram, kalim.akram;
(7) Susana Guzman, susana.guzman;
(8) Georgios Mastrapas, georgios.mastrapas;
(9) Saba Sturua, saba.sturua;
(10) Bo Wang, bo.wang;
(11) Maximilian Werk, maximilian.werk;
(12) Nan Wang, nan.wang;
(13) Han Xiao, han.xiao}@jina.ai.
Embedding training has undergone significant evolution, transitioning from foundational techniques such as Latent Semantic Indexing (LSA) [Deerwester et al., 1990] and Latent Dirichlet Allocation (LDA) [Blei et al., 2001] to the sophisticated prowess of pre-trained models like Sentence-BERT [Reimers and Gurevych, 2019]. A notable shift in recent advancements is the emphasis on unsupervised contrastive learning, as showcased by works like [Gao et al., 2022, Wang et al., 2022]. Pioneering models like Condenser [Gao and Callan, 2021] and RetroMAE [Xiao et al., 2022] have brought forth specialized architectures and pretraining methods explicitly designed for dense encoding and retrieval.
The E5 [Wang et al., 2022], Jina Embeddings v1 [Günther et al., 2023], and GTE [Li et al., 2023] collections of embedding models represent another leap forward. These models propose a holistic framework tailored for effective training across a myriad of tasks. This framework adopts a multi-stage contrastive training approach. An initial phase focuses on unsupervised training using a vast collection of weak pairs sourced from public data, enhancing the model’s domain generalization. Following this, a supervised fine-tuning stage employs a curated set of annotated text triples, representing diverse tasks. Together, these sequential stages yield state-of-the-art outcomes on the MTEB benchmark
Yet, despite such advancements, a glaring limitation persists: the 512-token constraint on input sequences, stemming from foundational models like BERT. This cap is insufficient for encoding lengthy documents, often exceeding a page. ALiBi [Press et al., 2022] emerges as a promising solution, presenting a technique that sidesteps conventional positional embeddings and facilitates training on sequences exceeding 2048 tokens. Notably, its typical application is centered around generative models, which inherently adopt a unidirectional bias, rendering it less suitable for embedding tasks.
Effective evaluation remains paramount for embedding models, ensuring they meet the diverse demands of real-world applications. The BEIR benchmark [Thakur et al., 2021] stands out, offering evaluations across a set of retrieval tasks and settings. Similarly, the MTEB benchmark [Muennighoff et al., 2023] highlights the extensive applicability of text embeddings, spanning a variety of tasks and languages. However, a notable gap in both benchmarks is their limited focus on encoding long documents — a critical aspect for comprehensive embedding evaluation.