JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process

cover
23 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Michael Günther, michael.guenther;

(2) Jackmin Ong, jackmin.ong;

(3) Isabelle Mohr, isabelle.mohr;

(4) Alaeddine Abdessalem, alaeddine.abdessalem;

(5) Tanguy Abel, tanguy.abel;

(6) Mohammad Kalim Akram, kalim.akram;

(7) Susana Guzman, susana.guzman;

(8) Georgios Mastrapas, georgios.mastrapas;

(9) Saba Sturua, saba.sturua;

(10) Bo Wang, bo.wang;

(11) Maximilian Werk, maximilian.werk;

(12) Nan Wang, nan.wang;

(13) Han Xiao, han.xiao}@jina.ai.

3 Training Process Overview

The training process for Jina Embeddings v2 is divided into three stages:

I Pre-training the Backbone: For the backbone pre-training, we design a modified BERT model capable of encoding documents with up to 8192 tokens. This model is trained on a full-text corpus using a masked language modeling objective.

II First Fine-tuning with Text Pairs: To encode a text passage into a single vector representation, the model is fine-tuned in an unsupervised manner on text pairs.

III Second Fine-tuning with Hard Negatives: The model is further fine-tuned using text pairs complemented with hard negatives. This

Table 1: Architecture specifications for the Jina BERT models of varying sizes. The number of attention heads is selected to ensure a consistent head dimension of 64.

stage is crucial for enabling the model to better distinguish between relevant passages and related, but irrelevant text passages.

While both stages II and III are geared towards training the models for embedding tasks, the latter is especially critical for improving the model’s performance in retrieval and classification tasks (refer to Section 6.2).