JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process

by EScholar: Electronic Academic Papers for ScholarsFebruary 23rd, 2024

Too Long; Didn't Read

Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information.

featured image - JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Michael Günther, michael.guenther;

(2) Jackmin Ong, jackmin.ong;

(3) Isabelle Mohr, isabelle.mohr;

(4) Alaeddine Abdessalem, alaeddine.abdessalem;

(5) Tanguy Abel, tanguy.abel;

(6) Mohammad Kalim Akram, kalim.akram;

(7) Susana Guzman, susana.guzman;

(8) Georgios Mastrapas, georgios.mastrapas;

(9) Saba Sturua, saba.sturua;

(10) Bo Wang, bo.wang;

(11) Maximilian Werk, maximilian.werk;

(12) Nan Wang, nan.wang;

(13) Han Xiao, han.xiao}@jina.ai.

Table of Links

3 Training Process Overview

The training process for Jina Embeddings v2 is divided into three stages:

I Pre-training the Backbone: For the backbone pre-training, we design a modified BERT model capable of encoding documents with up to 8192 tokens. This model is trained on a full-text corpus using a masked language modeling objective.

II First Fine-tuning with Text Pairs: To encode a text passage into a single vector representation, the model is fine-tuned in an unsupervised manner on text pairs.

III Second Fine-tuning with Hard Negatives: The model is further fine-tuned using text pairs complemented with hard negatives. This

stage is crucial for enabling the model to better distinguish between relevant passages and related, but irrelevant text passages.

While both stages II and III are geared towards training the models for embedding tasks, the latter is especially critical for improving the model’s performance in retrieval and classification tasks (refer to Section 6.2).