paint-brush
JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Workby@escholar

JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Work

tldt arrow

Too Long; Didn't Read

Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information.
featured image - JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Work
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Michael Günther, michael.guenther;

(2) Jackmin Ong, jackmin.ong;

(3) Isabelle Mohr, isabelle.mohr;

(4) Alaeddine Abdessalem, alaeddine.abdessalem;

(5) Tanguy Abel, tanguy.abel;

(6) Mohammad Kalim Akram, kalim.akram;

(7) Susana Guzman, susana.guzman;

(8) Georgios Mastrapas, georgios.mastrapas;

(9) Saba Sturua, saba.sturua;

(10) Bo Wang, bo.wang;

(11) Maximilian Werk, maximilian.werk;

(12) Nan Wang, nan.wang;

(13) Han Xiao, han.xiao}@jina.ai.

Embedding training has undergone significant evolution, transitioning from foundational techniques such as Latent Semantic Indexing (LSA) [Deerwester et al., 1990] and Latent Dirichlet Allocation (LDA) [Blei et al., 2001] to the sophisticated prowess of pre-trained models like Sentence-BERT [Reimers and Gurevych, 2019]. A notable shift in recent advancements is the emphasis on unsupervised contrastive learning, as showcased by works like [Gao et al., 2022, Wang et al., 2022]. Pioneering models like Condenser [Gao and Callan, 2021] and RetroMAE [Xiao et al., 2022] have brought forth specialized architectures and pretraining methods explicitly designed for dense encoding and retrieval.


The E5 [Wang et al., 2022], Jina Embeddings v1 [Günther et al., 2023], and GTE [Li et al., 2023] collections of embedding models represent another leap forward. These models propose a holistic framework tailored for effective training across a myriad of tasks. This framework adopts a multi-stage contrastive training approach. An initial phase focuses on unsupervised training using a vast collection of weak pairs sourced from public data, enhancing the model’s domain generalization. Following this, a supervised fine-tuning stage employs a curated set of annotated text triples, representing diverse tasks. Together, these sequential stages yield state-of-the-art outcomes on the MTEB benchmark


Yet, despite such advancements, a glaring limitation persists: the 512-token constraint on input sequences, stemming from foundational models like BERT. This cap is insufficient for encoding lengthy documents, often exceeding a page. ALiBi [Press et al., 2022] emerges as a promising solution, presenting a technique that sidesteps conventional positional embeddings and facilitates training on sequences exceeding 2048 tokens. Notably, its typical application is centered around generative models, which inherently adopt a unidirectional bias, rendering it less suitable for embedding tasks.


Effective evaluation remains paramount for embedding models, ensuring they meet the diverse demands of real-world applications. The BEIR benchmark [Thakur et al., 2021] stands out, offering evaluations across a set of retrieval tasks and settings. Similarly, the MTEB benchmark [Muennighoff et al., 2023] highlights the extensive applicability of text embeddings, spanning a variety of tasks and languages. However, a notable gap in both benchmarks is their limited focus on encoding long documents — a critical aspect for comprehensive embedding evaluation.