Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

PDF Abstract NAACL 2021 PDF NAACL 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Entity Linking KILT: AIDA-YAGO2 T5-base KILT-AC 74.05 # 5
R-Prec 74.05 # 6
Recall@5 74.05 # 7
Accuracy 74.05 # 6
Question Answering KILT: ELI5 RAG Rouge-L 14.05 # 7
F1 14.51 # 6
Question Answering KILT: ELI5 BART+DPR Rouge-L 17.41 # 6
F1 17.88 # 4
Question Answering KILT: ELI5 T5-base Rouge-L 19.08 # 5
F1 16.1 # 5
Open-Domain Question Answering KILT: ELI5 T5-base KILT-RL 0.0 # 6
R-Prec 0.0 # 10
Recall@5 0.0 # 10
ROUGE-L 19.08 # 4
F1 16.1 # 8
KILT-F1 0.0 # 6
Fact Verification KILT: FEVER T5-base KILT-AC 0.0 # 10
R-Prec 0.0 # 14
Recall@5 0.0 # 14
Accuracy 76.3 # 11
Fact Verification KILT: FEVER RAG KILT-AC 53.45 # 7
R-Prec 61.94 # 11
Recall@5 75.55 # 10
Accuracy 86.31 # 8
Open-Domain Question Answering KILT: HotpotQA T5-base KILT-EM 0.0 # 7
R-Prec 0.0 # 11
Recall@5 0.0 # 11
EM 12.64 # 8
F1 19.57 # 8
KILT-F1 0.0 # 7
Open-Domain Question Answering KILT: Natural Questions T5-base KILT-EM 0.0 # 9
R-Prec 0.0 # 13
Recall@5 0.0 # 13
EM 19.6 # 11
F1 27.73 # 11
KILT-F1 0.0 # 9
Slot Filling KILT: T-REx T5-base KILT-AC 0.0 # 13
R-Prec 0.0 # 16
Recall@5 0.0 # 16
Accuracy 43.56 # 14
F1 50.61 # 13
KILT-F1 0.0 # 13
Open-Domain Question Answering KILT: TriviaQA T5-base KILT-EM 0.0 # 9
R-Prec 0.0 # 13
Recall@5 0.0 # 13
EM 18.11 # 11
F1 27.83 # 11
KILT-F1 0.0 # 9
Open-Domain Dialog KILT: Wizard of Wikipedia T5-base KILT-RL 0.0 # 12
R-Prec 0.0 # 16
Recall@5 0.0 # 16
ROUGE-L 12.4 # 12
F1 13.53 # 12
KILT-F1 0.0 # 12
Entity Linking KILT: WNED-CWEB T5-base KILT-AC 49.29 # 3
R-Prec 49.29 # 4
Recall@5 49.29 # 5
Accuracy 49.29 # 3
Entity Linking KILT: WNED-WIKI T5-base KILT-AC 47.13 # 4
R-Prec 47.13 # 6
Recall@5 47.13 # 6
Accuracy 47.13 # 4
Slot Filling KILT: Zero Shot RE T5-base KILT-AC 0.0 # 13
R-Prec 0.0 # 17
Recall@5 0.0 # 17
Accuracy 9.02 # 15
F1 13.52 # 15
KILT-F1 0.0 # 14

Methods