Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Zero-shot Text Search BEIR cpt-text XL Avg. Accuracy 52.8 # 4
Zero-shot Text Search BEIR BM25 (Robertson, 2009) Avg. Accuracy 47.6 # 13
Zero-shot Text Search BEIR Contriever (Izacard et al., 2021) Avg. Accuracy 50.2 # 12
Zero-shot Text Search BEIR Contriever (Izacard et al., 2021)-unsupervised Avg. Accuracy 40.9 # 18
Zero-shot Text Search BEIR cpt-text L Avg. Accuracy 44.2 # 16
Code Search CodeSearchNet cpt-code M Overall 93.5 # 1
Go 97.5 # 2
Ruby 85.5 # 2
Python 99.9 # 1
Java 94.4 # 1
JS 86.5 # 1
PHP 97.2 # 1
Code Search CodeSearchNet cpt-code S Overall 93.4 # 2
Go 97.7 # 1
Ruby 86.3 # 1
Python 99.8 # 2
Java 94.0 # 2
JS 86.0 # 2
PHP 96.7 # 2
Passage Ranking MS MARCO Fine-tuned SOTA MRR@10 44.3 # 1
Passage Ranking MS MARCO cpt-text XL MRR@10 22.7 # 2
Passage Ranking MS MARCO cpt-text L MRR@10 21.5 # 3
Passage Ranking MS MARCO BM25 MRR@10 18.4 # 4
Linear-Probe Classification SentEval cpt-text XL-unsupervised Accuracy 91.8 # 2
Linear-Probe Classification SentEval cpt-text XL-supervised Accuracy 92.2 # 1

Methods


No methods listed for this paper. Add relevant methods here