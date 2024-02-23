Search icon
    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Workby@escholar

    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Work

    by EScholar: Electronic Academic Papers for ScholarsFebruary 23rd, 2024
    Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information.
    This paper is available on arxiv under CC 4.0 license.

    Embedding training has undergone significant evolution, transitioning from foundational techniques such as Latent Semantic Indexing (LSA) [Deerwester et al., 1990] and Latent Dirichlet Allocation (LDA) [Blei et al., 2001] to the sophisticated prowess of pre-trained models like Sentence-BERT [Reimers and Gurevych, 2019]. A notable shift in recent advancements is the emphasis on unsupervised contrastive learning, as showcased by works like [Gao et al., 2022, Wang et al., 2022]. Pioneering models like Condenser [Gao and Callan, 2021] and RetroMAE [Xiao et al., 2022] have brought forth specialized architectures and pretraining methods explicitly designed for dense encoding and retrieval.


    The E5 [Wang et al., 2022], Jina Embeddings v1 [Günther et al., 2023], and GTE [Li et al., 2023] collections of embedding models represent another leap forward. These models propose a holistic framework tailored for effective training across a myriad of tasks. This framework adopts a multi-stage contrastive training approach. An initial phase focuses on unsupervised training using a vast collection of weak pairs sourced from public data, enhancing the model’s domain generalization. Following this, a supervised fine-tuning stage employs a curated set of annotated text triples, representing diverse tasks. Together, these sequential stages yield state-of-the-art outcomes on the MTEB benchmark


    Yet, despite such advancements, a glaring limitation persists: the 512-token constraint on input sequences, stemming from foundational models like BERT. This cap is insufficient for encoding lengthy documents, often exceeding a page. ALiBi [Press et al., 2022] emerges as a promising solution, presenting a technique that sidesteps conventional positional embeddings and facilitates training on sequences exceeding 2048 tokens. Notably, its typical application is centered around generative models, which inherently adopt a unidirectional bias, rendering it less suitable for embedding tasks.


    Effective evaluation remains paramount for embedding models, ensuring they meet the diverse demands of real-world applications. The BEIR benchmark [Thakur et al., 2021] stands out, offering evaluations across a set of retrieval tasks and settings. Similarly, the MTEB benchmark [Muennighoff et al., 2023] highlights the extensive applicability of text embeddings, spanning a variety of tasks and languages. However, a notable gap in both benchmarks is their limited focus on encoding long documents — a critical aspect for comprehensive embedding evaluation.

