DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set.
40 PAPERS • 3 BENCHMARKS
WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),
39 PAPERS • NO BENCHMARKS YET
FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.
38 PAPERS • 1 BENCHMARK
InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.
38 PAPERS • NO BENCHMARKS YET
Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.
37 PAPERS • NO BENCHMARKS YET
TyDiQA is the gold passage version of the Typologically Diverse Question Answering (TyDiWA) dataset, a benchmark for information-seeking question answering, which covers nine languages. The gold passage version is a simplified version of the primary task, which uses only the gold passage as context and excludes unanswerable questions. It is thus similar to XQuAD and MLQA, while being more challenging as questions have been written without seeing the answers, leading to 3× and 2× less lexical overlap compared to XQuAD and MLQA respectively.
37 PAPERS • 1 BENCHMARK
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.
36 PAPERS • NO BENCHMARKS YET
Contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.
35 PAPERS • NO BENCHMARKS YET
The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.
For goal-oriented document-grounded dialogs, it often involves complex contexts for identifying the most relevant information, which requires better understanding of the inter-relations between conversations and documents. Meanwhile, many online user-oriented documents use both semi-structured and unstructured contents for guiding users to access information of different contexts. Thus, we create a new goal-oriented document-grounded dialogue dataset that captures more diverse scenarios derived from various document contents from multiple domains such ssa.gov and studentaid.gov. For data collection, we propose a novel pipeline approach for dialogue data construction, which has been adapted and evaluated for several domains.
34 PAPERS • NO BENCHMARKS YET
BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, comprehensive benchmarks and leaderboards such as GLUE have greatly accelerated progress in open-domain NLP. In biomedicine, however, such resources are ostensibly scarce. In the past, there have been a plethora of shared tasks in biomedical NLP, such as BioCreative, BioNLP Shared Tasks, SemEval, and BioASQ, to name just a few. These efforts have played a significant role in fueling interest and progress by the research community, but they typically focus on individual tasks. The advent of neural language models such as BERTs provides a unifying foundation to leverage transfer learning from unlabeled text to support a wide range of NLP applications. To accelerate progress in biomedical pretraining strategies and task-specific methods, it is thus imperative to create a broad-coverage benchmark encompassing diverse biomedical tasks.
32 PAPERS • 2 BENCHMARKS
SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources.
The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.
32 PAPERS • 1 BENCHMARK
Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.
32 PAPERS • NO BENCHMARKS YET
Uses structured and unstructured data. The dataset introduces the evidence information containing a reasoning path for multi-hop questions.
31 PAPERS • NO BENCHMARKS YET
The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.
31 PAPERS • 1 BENCHMARK
SCROLLS (Standardized CompaRison Over Long Language Sequences) is an NLP benchmark consisting of a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. The dataset is made available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2,487/2,489 images.
30 PAPERS • 2 BENCHMARKS
DVQA is a synthetic question-answering dataset on images of bar-charts.
29 PAPERS • 1 BENCHMARK
PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.
27 PAPERS • 1 BENCHMARK
Natural Language Decathlon Benchmark (decaNLP) is a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. The tasks as cast as question answering over a context.
27 PAPERS • NO BENCHMARKS YET
The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.
26 PAPERS • 2 BENCHMARKS
QuaRTz is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).
26 PAPERS • NO BENCHMARKS YET
VQA-HAT (Human ATtention) is a dataset to evaluate the informative regions of an image depending on the question being asked about it. The dataset consists of human visual attention maps over the images in the original VQA dataset. It contains more than 60k attention maps.
WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).
A large scale dataset to enable the transfer step, exploiting the Natural Questions dataset.
25 PAPERS • 1 BENCHMARK
ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 conversations containing 14115 questions where 2715 of the conversations are simple conversations, and the rest 1,177 are hybrid conversations.
25 PAPERS • 2 BENCHMARKS
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.
A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.
PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed
25 PAPERS • 5 BENCHMARKS
We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop.
24 PAPERS • 2 BENCHMARKS
The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems requiring general world knowledge. The GenericsKB is the first large-scale resource containing naturally occurring generic sentences (as opposed to extracted or crowdsourced triples), and is rich in high-quality, general, semantically complete statements. Generics were primarily extracted from three large text sources, namely the Waterloo Corpus, selected parts of Simple Wikipedia, and the ARC Corpus. A filtered, high-quality subset is also available in GenericsKB-Best, containing 1,020,868 sentences.
24 PAPERS • NO BENCHMARKS YET
CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). The citing context from the judicial decision serves as the prompt for the question. The answer choices are holding statements derived from citations following text in a legal decision. There are five answer choices for each citing text. The correct answer is the holding statement that corresponds to the citing text. The four incorrect answers are other holding statements.
23 PAPERS • 2 BENCHMARKS
MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.
23 PAPERS • NO BENCHMARKS YET
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
23 PAPERS • 1 BENCHMARK
BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.
22 PAPERS • NO BENCHMARKS YET
The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.
22 PAPERS • 1 BENCHMARK
Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).
20 PAPERS • NO BENCHMARKS YET
FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly story narratives, covering seven types of narrative elements or relations. It can support narrative Question Generation (QG) and Narrative Question Answering (QA) tasks.
20 PAPERS • 2 BENCHMARKS
MultiDoc2Dial is a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. We aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents.
The WIQA dataset V1 has 39705 questions containing a perturbation and a possible effect in the context of a paragraph. The dataset is split into 29808 train questions, 6894 dev questions and 3003 test questions.
One of the largest commonsense knowledge bases available, describing over 2 million disambiguated concepts and activities, connected by over 18 million assertions.
A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.
19 PAPERS • 1 BENCHMARK
PrOntoQA is a question-answering dataset which generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models such as GPT-3.
19 PAPERS • NO BENCHMARKS YET
QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.
The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences
18 PAPERS • NO BENCHMARKS YET
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.
18 PAPERS • 1 BENCHMARK
HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.