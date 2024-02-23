Search icon
ReadWrite
see notifications
Notifications
see more
    paint-brush
    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Processby@escholar

    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process

    by EScholar: Electronic Academic Papers for ScholarsFebruary 23rd, 2024
    Read on Terminal Reader
    Read this story w/o Javascript
    tldt arrow

    Too Long; Didn't Read

    Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information.
    featured image - JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process
    EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

    This paper is available on arxiv under CC 4.0 license.

    Authors:

    (1) Michael Günther, michael.guenther;

    (2) Jackmin Ong, jackmin.ong;

    (3) Isabelle Mohr, isabelle.mohr;

    (4) Alaeddine Abdessalem, alaeddine.abdessalem;

    (5) Tanguy Abel, tanguy.abel;

    (6) Mohammad Kalim Akram, kalim.akram;

    (7) Susana Guzman, susana.guzman;

    (8) Georgios Mastrapas, georgios.mastrapas;

    (9) Saba Sturua, saba.sturua;

    (10) Bo Wang, bo.wang;

    (11) Maximilian Werk, maximilian.werk;

    (12) Nan Wang, nan.wang;

    (13) Han Xiao, han.xiao}@jina.ai.

    3 Training Process Overview

    The training process for Jina Embeddings v2 is divided into three stages:


    I Pre-training the Backbone: For the backbone pre-training, we design a modified BERT model capable of encoding documents with up to 8192 tokens. This model is trained on a full-text corpus using a masked language modeling objective.


    II First Fine-tuning with Text Pairs: To encode a text passage into a single vector representation, the model is fine-tuned in an unsupervised manner on text pairs.


    III Second Fine-tuning with Hard Negatives: The model is further fine-tuned using text pairs complemented with hard negatives. This


    Table 1: Architecture specifications for the Jina BERT models of varying sizes. The number of attention heads is selected to ensure a consistent head dimension of 64.


    stage is crucial for enabling the model to better distinguish between relevant passages and related, but irrelevant text passages.


    While both stages II and III are geared towards training the models for embedding tasks, the latter is especially critical for improving the model’s performance in retrieval and classification tasks (refer to Section 6.2).

    MongoDB
    L O A D I N G
    . . . comments & more!

    About Author

    EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
    EScholar: Electronic Academic Papers for Scholars@escholar
    We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
    Read my storiesRead My Stories

    TOPICS

    purcat-imgmachine-learning #text-embedding-models #jina-embeddings-v2 #narrativeqa #text-embedding-ada-00 #text-embedding-token-limits #information-retrieval #machine-learning-research #text-re-ranking

    THIS ARTICLE WAS FEATURED IN...

    Permanent on Arweave
    Read on Terminal Reader Terminal
    Read this story w/o Javascript Lite

    RELATED STORIES

    Article Thumbnail
    Zero-Knowledge Proofs: Questionnaire Result Verification in Smart Contracts
    by escholar
    Feb 02, 2024
    #zero-knowledge-proofs
    Article Thumbnail
    Measuring Information Retrieval Quality: Overview and Technical Metrics
    by bochkarevalex
    Nov 04, 2023
    #data
    Article Thumbnail
    The Science Behind Full-Text Search Engines
    by raffaeleflorio
    Feb 09, 2023
    #web-development
    Article Thumbnail
    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Abstract & Intro
    by escholar
    Feb 23, 2024
    #text-embedding-models
    Article Thumbnail
    JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Related Work
    by escholar
    Feb 23, 2024
    #text-embedding-models
    Join HackerNoonloading
    Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas