This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Michael Günther, michael.guenther;
(2) Jackmin Ong, jackmin.ong;
(3) Isabelle Mohr, isabelle.mohr;
(4) Alaeddine Abdessalem, alaeddine.abdessalem;
(5) Tanguy Abel, tanguy.abel;
(6) Mohammad Kalim Akram, kalim.akram;
(7) Susana Guzman, susana.guzman;
(8) Georgios Mastrapas, georgios.mastrapas;
(9) Saba Sturua, saba.sturua;
(10) Bo Wang, bo.wang;
(11) Maximilian Werk, maximilian.werk;
(12) Nan Wang, nan.wang;
(13) Han Xiao, han.xiao}@jina.ai.
Table of Links
- Abstract & Introduction
- Related Work
- Training Process Overview
- Backbone Pre-training
- Fine-Tuning for Embeddings
- Evaluation
- Conclusion & References
- Appendix
6 Evaluation
To evaluate the efficacy of our approach, we initiate with a comprehensive analysis of our pre-trained backbone models, as outlined in Section 6.1. This is followed by an in-depth assessment of our embedding models in Section 6.2. Furthermore, we have conducted experiments to delve into the effects of encoding extended sequence lengths on the performance of the embeddings, presented in Section 6.2.2.
6.1 Evaluation of Jina BERT
Following previous work [Liu et al., 2019b], we evaluate our pretrained models on the GLUE bench
mark [Wang et al., 2018]. General Language Understanding Evaluation (GLUE) is a collection of nine datasets for evaluating natural language understanding systems. Six tasks are framed as either single-sentence classification or sentence-pair classification tasks. The GLUE organizers provide training, development, and test data splits, as well as a submission server and leaderboard.[5] The test split does not contain labels, and the submission server allows participants to evaluate and compare their systems against the private labels of the test split.
For the Jina BERT training described in Section 4, we fine-tune the pre-trained models on the corresponding single-task training data using several hyperparameter settings and, for each task, pick the best fine-tuning hyperparameters on the development set.
Following the methodology of [Phang et al., 2018], for RTE, STS, and MRPC, we fine-tune starting from the MNLI single-task model, rather than the baseline pretrained Jina BERT models. As in the BERT paper [Devlin et al., 2019], our finetuning procedure relies on representing the input sequence and using the final hidden vector C ∈ R H corresponding to the first input token ([CLS]) as the aggregate representation.
We train for 10 epochs with batch sizes {16, 32} and learning rates {1e−5, 2e−5, 3e−5}. For each task, the best fine-tuned model on the development set is used for the test set.
In Table 2, we report the results of the best performing models on the test sets after submission to the GLUE benchmark server.
Furthermore, we evaluate Jina BERT models on documents of long text sequences by computing the accuracy of the MLM task with varying sequence lengths. The accuracy of masked language modeling is computed on 50, 000 samples from the C4 validation set where, for each chosen sequence length, each sample document is tokenized and truncated to fit the sequence length. We compare Jina BERT to RoBERTa and BERT models in Figure 2. It essentially shows that, even though Jina BERT models were trained on a 512 sequence length, the MLM accuracy does not drop when we extrapolate to an 8192 sequence length. For other BERT and RoBERTa models, since they use absolute positional embeddings that are trained on a 512 sequence length, it’s not possible to compute the MLM accuracy beyond 512. The figure demonstrates ALiBi’s effectiveness in maintaining MLM performance during inference for long documents.
6.2 Evaluation of Jina Embeddings v2
To comprehensively evaluate our embedding models, we employ the Massive Text Embedding Benchmark (MTEB) [Muennighoff et al., 2023]. Our choice of MTEB is motivated by its unparalleled breadth, distinguishing it among embedding benchmarks. Rather than focusing on a single task and dataset, MTEB covers an expansive set of 8 tasks, encompassing a rich collection of 58 datasets across 112 languages. This expansive benchmark allows us to scrutinize our model’s adaptability across diverse applications and languages and benchmark it against other topperforming models.
However, a limitation of the MTEB benchmark is its omission of very long texts, which are essential for evaluating our model’s prowess in handling 8192 sequence lengths. Consequently, we introduce new retrieval and clustering tasks featuring extended documents, and we detail the performance of our model against its peers in Section 6.2.2.
Clustering: The goal here is to aptly group a collection of sentences or paragraphs. Within the MTEB benchmark suite, a mini-batch k-means model is employed, operating with a batch size of 32. Here, k represents the number of unique labels in the dataset. Model performance is evaluated using the V measure, a metric insensitive to cluster label permutations, guaranteeing that assessments are independent of label configurations.
We incorporate two new clustering tasks featuring extended documents within the MTEB clustering task subset. The inaugural task, named PatentClustering, draws from the BigPatent[6] dataset [Sharma et al., 2019], challenging the kmeans model to organize patents by their respective categories. Patent documents average 6, 376 tokens, spanning a range from a brief 569 tokens to an extensive 218, 434 tokens. Our second task, titled WikiCitiesClustering, sources from the English subset of the refined Wikipedia dump [Foundation, 2022], available as a dataset on Hugging Face[7] . For this task, we curate a roster of nations from Wikidata and extract Wikipedia articles of their cities from the refined dataset. The objective is to group cities by their parent country. On average, articles consist of 2, 031 tokens, with the length varying between a succinct 21 tokens to a comprehensive 20, 179 tokens.
Retrieval: This task entails a dataset comprising a corpus, a set of queries, and associated mappings connecting each query to pertinent corpus documents. The mission is to discern relevant documents for a specific query. Both queries and corpus documents undergo encoding, after which their similarity scores are derived using cosine similarity. Subsequently, metrics like nDCG@10 (which serves as the primary metric), MRR@k, MAP@k, precision@k, and recall@k are computed for diverse k values. This task is inspired by datasets and evaluation methods presented by BEIR [Thakur et al., 2021].
To expand the scope of the MTEB, we introduce a new retrieval task named NarrativeQA, derived from the narrativeqa[8] dataset. This dataset boasts realistic QA instances, curated from literature (encompassing both fiction and non-fiction) and film scripts. The corpus averages 74, 843 tokens per document, with the lengthiest document tallying up to 454, 746 tokens, and the most concise one comprising 4, 550 tokens.
6.2.1 Results on MTEB
The evaluation of embedding models within the MTEB benchmark, as illustrated in Table 3, reveals significant contrasts between Jina’s text embedding models, namely jina-small-v2 and jina-base-v2, and other contemporary models. These differences are especially pronounced in tasks showing marked performance disparities, such as Classification (CF) and Retrieval (RT).
In Classification (CF), the jina-base-v2 model, equipped with 137 million parameters, emerges as a leading performer. It records superior scores, outpacing most competing models, underscoring its efficacy in text classification. Conversely, the jina-small-v2 model, equipped with a modest 33 million parameters, trails behind some other models in this task. This underscores the pivotal role model size plays in certain downstream tasks, with more extensive architectures yielding potential benefits.
For the Retrieval (RT) task, jina-small-v2 showcases formidable performance, signaling its adeptness for information retrieval. It ranks amidst top-tier models, indicating its prowess in retrieval-centric tasks. Similarly, jina-base-v2 excels, registering a slightly superior score, reaffirming its formidable retrieval aptitude. Both models underscore their credibility in tasks necessitating adept information retrieval. Given that models all-MiniLM-L6-v2 and all-mpnet-base-v2 omit the second-stage finetuning which jina-small-v2 and jina-base-v2 undergo, it’s foreseeable that our models would excel in these tasks.
In conclusion, both the base and small text embedding models display commendable performance within the MTEB benchmark. Their standout performance, relative to other models in tasks like Classification and Retrieval, suggests model size’s influential role in specific text processing endeavors. Both models reaffirm their potency in retrieval, marking them as pivotal tools for a plethora of natural language processing tasks.
6.2.2 Impact of Maximum Sequence Length
As delineated in Section 6.1, the pre-training generalizes across extended sequence lengths. Consequently, the MLM accuracy for long sequences, spanning up to 8192 tokens, mirrors that of shorter sequences, despite the exclusive training on abbreviated text sequences. During finetuning, our models train solely on texts not exceeding 512 tokens, yet they cater to texts reaching 8192 tokens for the MTEB evaluation detailed in Section 6.2.
To discern how sequence length impacts the accuracy of downstream tasks, we executed long document clustering and retrieval tasks, modulating the tokenizer’s maximum sequence length. This allows us to gauge the models’ performance on variable sequence lengths through truncation. Since a majority of the extant tasks in the MTEB feature documents under 512 tokens, we resort to our three novel datasets elucidated in Section 6.2, accessible on Hugging Face[10]. Furthermore, we employ the SciFact dataset [Wadden et al., 2020], given its substantial count of texts exceeding 512 tokens.
Figure 3 depicts the nDCG@10 retrieval and the V measure scores for the jina-base-v2 alongside four other renowned embedding models. Given that only jina-base-v2 and OpenAI’s text-embedding-ada-002 support an 8k sequence length, results reported for an 8191 sequence length for other models are truncated to their intrinsic maximum, typically 512. Generally, Figure 3 suggests that elongated sequence lengths contribute to enhanced outcomes. This assertion is particularly true for the NarrativeQA task, where extending the sequence length substantially bolsters performance. Due to the inherent nature of the dataset, models limited to the text’s commencement frequently underperform.
On the BigPatent clustering task, larger sequence lengths also result in better performance. However, on the WikiCities clustering task, longer sequence lengths seem to slightly diminish the models’ performance in most instances. This suggests that an increase in sequence length doesn’t always yield better outcomes. One explanation for this observation is that the initial paragraph of a Wikipedia
article about a city typically mentions the country the city is in. Information towards the middle and end of the articles is often less pertinent for identifying the country and might alter the attributes that influence the clustering of the city embeddings.
[5] https://gluebenchmark.com
[6] https://huggingface.co/datasets/big_patent
[7] https://huggingface.co/datasets/wikipedia
[8] https://huggingface.co/datasets/narrativeqa
[9] For e5-base-v2, we abstained from employing specific prefixes like “query: ”, which might result in varied evaluation outcomes. text-embedding-ada-002 caps its encoding at 8191 tokens, not 8192.
[10] Our datasets are available at https://huggingface.co/ jinaai