JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings: Backbone Pre-training

cover
23 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Michael Günther, michael.guenther;

(2) Jackmin Ong, jackmin.ong;

(3) Isabelle Mohr, isabelle.mohr;

(4) Alaeddine Abdessalem, alaeddine.abdessalem;

(5) Tanguy Abel, tanguy.abel;

(6) Mohammad Kalim Akram, kalim.akram;

(7) Susana Guzman, susana.guzman;

(8) Georgios Mastrapas, georgios.mastrapas;

(9) Saba Sturua, saba.sturua;

(10) Bo Wang, bo.wang;

(11) Maximilian Werk, maximilian.werk;

(12) Nan Wang, nan.wang;

(13) Han Xiao, han.xiao}@jina.ai.

4 Backbone Pre-training

For the backbone training, we introduce a novel transformer model. Although its architecture is akin to the BERT model proposed by [Devlin et al., 2019], we implement several modifications to enhance its ability to encode extended text sequences and to generally bolster its language modeling capabilities. For the training process, we largely adopt the approach described in [Liu et al., 2019a], incorporating additional performance optimizations.

4.1 Model Architecture

Attention with Linear Biases: For the selfattention mechanism within the attention blocks, we adopt the Attention with Linear Biases (ALiBi) approach [Press et al., 2022]. ALiBi forgoes the use of positional embeddings. Instead, it encodes positional information directly within the self-attention layer by introducing a constant bias term to the attention score matrix of each layer, ensuring that proximate tokens demonstrate stronger mutual attention. While the original implementation was designed for causal language modeling and featured biases solely in the causal direction, such an approach is not compatible with the bidirectional self-attention inherent in our encoder model. For our purposes, we employ the symmetric encoder variant where attention biases are mirrored to ensure consistency in both directions[3] . Figure 1 depicts the computation of attention scores within the multi-head attention heads. Each head’s scaling value, mi , out of the total n heads, is derived using Equation (1).

Gated Linear Units: For the feedforward sublayers within the attention blocks, we adopt Gated Linear Units (GLU), originally introduced in [Dauphin et al., 2016]. They’ve demonstrated performance enhancements when incorporated into transformers [Shazeer, 2020]. For the small and base models, we employ the GEGLU variant, which leverages the GELU activation function for the GLU. Conversely, for the large model, we utilize the ReGLU variant with the ReLU activation function. This choice was driven by our observation that training the large model with GEGLU, despite its promising initial MLM accuracy, was unstable.

Layer Normalization: Regarding Layer Normalization [?], we align with the post-layer normalization approach from [Vaswani et al., 2017] in our attention blocks. Preliminary tests with pre-layer normalization, as mentioned in [Shoeybi et al., 2019] and [Nguyen and Salazar, 2019], didn’t enhance training stability or performance. Consequently, we opted not to integrate it into our model.

4.2 Full-Text Datasets for Pre-training

For the pre-training phase, we leverage the English “Colossal, Cleaned, Common Crawl (C4)” dataset [4] , encompassing approximately 365 million text documents harvested from the web, summing to around 170 billion tokens. As delineated in [Raffel et al., 2020], the C4 dataset is a refined iteration of Common Crawl, utilizing heuristics for cleanup and language recognition, retaining solely English content. As a result, our models are monolingual and tailored exclusively for English texts. The purification process also encompasses the removal of webpages hosting inappropriate content. We reserve 1% of the dataset for evaluating validation loss and the accuracy of the masked language modeling (MLM) task.

Figure 1: With ALiBi attention, a linear bias is incorporated into each attention score preceding the softmaxoperation. Each attention head employs a distinct constant scalar, m, which diversifies its computation. Our model

4.3 Training Algorithm

Our model’s pre-training revolves around the masked language modeling objective, excluding the next sentence prediction (NSP) task due to its perceived limited contribution to downstream task performance [Liu et al., 2019a]. We mask 30% of the input tokens randomly, employing whole word masking [Devlin et al., 2019], and condition the models to infer these masked tokens. Of these masked tokens, 80% are substituted with the [MASK] token, 10% with a random token, and the remaining 10% stay unaltered.

Given our model’s reliance on ALiBi attention [Press et al., 2022], training position embeddings becomes unnecessary. This allows us to pre-train more efficiently on shorter sequences and adapt to longer sequences in subsequent tasks. Throughout our pre-training, we operate on sequences capped at 512 tokens in length. Diverging from the methods in [Devlin et al., 2019] and [Liu et al., 2019a], our sequences originate from individual documents without any multi-document packing. Furthermore, we refrain from sampling

multiple sequences from a singular document. For each document, we exclusively consider its initial 512 tokens, truncating any excess. Given our consistent global batch size of 4096, each batch, due to its varying sequence length, contains a unique number of masked tokens when calculating loss.

Optimizer: Mirroring the optimization strategy of RoBERTa [Liu et al., 2019a], we employ the AdamW algorithm [Loshchilov and Hutter, 2017], characterized by parameters β1 = 0.9, β2 = 0.98, ϵ = 1e − 6, a weight decay of 0.01, dropout set at 0.1, and attention dropout also at 0.1. Our learning rate schedule is linear, starting at 0 and peaking at a rate of η post 10, 000 steps. Here, the values of η are designated as 1e−3, 6e−4, and 4e−4 for the small, base, and large models respectively. A linear decay to zero ensues after reaching the 100, 000 steps threshold.

Mixed precision training: We resort to FP16 dynamic mixed precision [Micikevicius et al., 2018] for pre-training our models, facilitated by the DeepSpeed software package [Rasley et al., 2020]. Our preliminary tests using BF16 revealed unsatisfactory performance metrics, both in MLM accuracy and the downstream GLUE tasks.


[3] https://github.com/ofirpress/attention_with_ linear_biases/issues/5

[4] https://huggingface.co/datasets/c4