🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

360 dataset results for Question Answering

DART is a large dataset for open-domain structured data record to text generation. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set.

40 PAPERS • 3 BENCHMARKS

WikiMovies

WikiMovies is a dataset for question answering for movies content. It contains ~100k questions in the movie domain, and was designed to be answerable by using either a perfect KB (based on OMDb),

39 PAPERS • NO BENCHMARKS YET

FigureQA

FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.

38 PAPERS • 1 BENCHMARK

InsuranceQA

InsuranceQA is a question answering dataset for the insurance domain, the data stemming from the website Insurance Library. There are 12,889 questions and 21,325 answers in the training set. There are 2,000 questions and 3,354 answers in the validation set. There are 2,000 questions and 3,308 answers in the test set.

38 PAPERS • NO BENCHMARKS YET

MKQA (Multilingual Knowledge Questions and Answers)

Multilingual Knowledge Questions and Answers (MKQA) is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). The goal of this dataset is to provide a challenging benchmark for question answering quality across a wide set of languages. Answers are based on a language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering.

37 PAPERS • NO BENCHMARKS YET

TyDiQA-GoldP

TyDiQA is the gold passage version of the Typologically Diverse Question Answering (TyDiWA) dataset, a benchmark for information-seeking question answering, which covers nine languages. The gold passage version is a simplified version of the primary task, which uses only the gold passage as context and excludes unanswerable questions. It is thus similar to XQuAD and MLQA, while being more challenging as questions have been written without seeing the answers, leading to 3× and 2× less lexical overlap compared to XQuAD and MLQA respectively.

37 PAPERS • 1 BENCHMARK

BREAK

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

36 PAPERS • NO BENCHMARKS YET

CSQA

Contains around 200K dialogs with a total of 1.6M turns. Further, unlike existing large scale QA datasets which contain simple questions that can be answered from a single tuple, the questions in the dialogs require a larger subgraph of the KG.

35 PAPERS • NO BENCHMARKS YET

MSLR-WEB10K

The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.

35 PAPERS • NO BENCHMARKS YET

Doc2Dial

Doc2Dial (Doc2Dial: Document-grounded Dialogue)

For goal-oriented document-grounded dialogs, it often involves complex contexts for identifying the most relevant information, which requires better understanding of the inter-relations between conversations and documents. Meanwhile, many online user-oriented documents use both semi-structured and unstructured contents for guiding users to access information of different contexts. Thus, we create a new goal-oriented document-grounded dialogue dataset that captures more diverse scenarios derived from various document contents from multiple domains such ssa.gov and studentaid.gov. For data collection, we propose a novel pipeline approach for dialogue data construction, which has been adapted and evaluated for several domains.

34 PAPERS • NO BENCHMARKS YET

BLURB (Biomedical Language Understanding and Reasoning Benchmark)

BLURB is a collection of resources for biomedical natural language processing. In general domains such as newswire and the Web, comprehensive benchmarks and leaderboards such as GLUE have greatly accelerated progress in open-domain NLP. In biomedicine, however, such resources are ostensibly scarce. In the past, there have been a plethora of shared tasks in biomedical NLP, such as BioCreative, BioNLP Shared Tasks, SemEval, and BioASQ, to name just a few. These efforts have played a significant role in fueling interest and progress by the research community, but they typically focus on individual tasks. The advent of neural language models such as BERTs provides a unifying foundation to leverage transfer learning from unlabeled text to support a wide range of NLP applications. To accelerate progress in biomedical pretraining strategies and task-specific methods, it is thus imperative to create a broad-coverage benchmark encompassing diverse biomedical tasks.

32 PAPERS • 2 BENCHMARKS

SciREX

SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources.

32 PAPERS • 2 BENCHMARKS

TQA (Textbook Question Answering)

The TextbookQuestionAnswering (TQA) dataset is drawn from middle school science curricula. It consists of 1,076 lessons from Life Science, Earth Science and Physical Science textbooks. This includes 26,260 questions, including 12,567 that have an accompanying diagram.

32 PAPERS • 1 BENCHMARK

Worldtree

Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.

32 PAPERS • NO BENCHMARKS YET

2WikiMultiHopQA

Uses structured and unstructured data. The dataset introduces the evidence information containing a reasoning path for multi-hop questions.

31 PAPERS • NO BENCHMARKS YET

OTT-QA

The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.

31 PAPERS • 1 BENCHMARK

SCROLLS (Standardized CompaRison Over Long Language Sequences)

SCROLLS (Standardized CompaRison Over Long Language Sequences) is an NLP benchmark consisting of a suite of tasks that require reasoning over long texts. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. The dataset is made available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.

31 PAPERS • 1 BENCHMARK

Image Paragraph Captioning

The Image Paragraph Captioning dataset allows researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2,487/2,489 images.

30 PAPERS • 2 BENCHMARKS

DVQA (Data Visualizations via Question Answering)

DVQA is a synthetic question-answering dataset on images of bar-charts.

29 PAPERS • 1 BENCHMARK

PathVQA

PathVQA consists of 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness.

27 PAPERS • 1 BENCHMARK

decaNLP (Natural Language Decathlon Benchmark)

Natural Language Decathlon Benchmark (decaNLP) is a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. The tasks as cast as question answering over a context.

27 PAPERS • NO BENCHMARKS YET

CODAH (COmmonsense Dataset Adversarially-authored by Humans)

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.

26 PAPERS • 2 BENCHMARKS

QuaRTz (QuaRTz Dataset)

QuaRTz is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).

26 PAPERS • NO BENCHMARKS YET

VQA-HAT (VQA Human Attention)

VQA-HAT (Human ATtention) is a dataset to evaluate the informative regions of an image depending on the question being asked about it. The dataset consists of human visual attention maps over the images in the original VQA dataset. It contains more than 60k attention maps.

26 PAPERS • NO BENCHMARKS YET

WikiReading

WikiReading is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. The task contains a rich variety of challenging classification and extraction sub-tasks, making it well-suited for end-to-end models such as deep neural networks (DNNs).

26 PAPERS • NO BENCHMARKS YET

ASNQ

ASNQ (Answer Sentence Natural Questions)

A large scale dataset to enable the transfer step, exploiting the Natural Questions dataset.

25 PAPERS • 1 BENCHMARK

ConvFinQA (Conversational Finance Question Answering)

ConvFinQA is a dataset designed to study the chain of numerical reasoning in conversational question answering. The dataset contains 3892 conversations containing 14115 questions where 2715 of the conversations are simple conversations, and the rest 1,177 are hybrid conversations.

25 PAPERS • 2 BENCHMARKS

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.

25 PAPERS • 1 BENCHMARK

Molweni

A machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances.

25 PAPERS • 2 BENCHMARKS

PlotQA

PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed

25 PAPERS • 5 BENCHMARKS

AdversarialQA

We have created three new Reading Comprehension datasets constructed using an adversarial model-in-the-loop.

24 PAPERS • 2 BENCHMARKS

GenericsKB

The GenericsKB contains 3.4M+ generic sentences about the world, i.e., sentences expressing general truths such as "Dogs bark," and "Trees remove carbon dioxide from the atmosphere." Generics are potentially useful as a knowledge source for AI systems requiring general world knowledge. The GenericsKB is the first large-scale resource containing naturally occurring generic sentences (as opposed to extracted or crowdsourced triples), and is rich in high-quality, general, semantically complete statements. Generics were primarily extracted from three large text sources, namely the Waterloo Corpus, selected parts of Simple Wikipedia, and the ARC Corpus. A filtered, high-quality subset is also available in GenericsKB-Best, containing 1,020,868 sentences.

24 PAPERS • NO BENCHMARKS YET

CaseHOLD

CaseHOLD (Case Holdings On Legal Decisions)

CaseHOLD (Case Holdings On Legal Decisions) is a law dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). The citing context from the judicial decision serves as the prompt for the question. The answer choices are holding statements derived from citations following text in a legal decision. There are five answer choices for each citing text. The correct answer is the holding statement that corresponds to the citing text. The four incorrect answers are other holding statements.

23 PAPERS • 2 BENCHMARKS

MathVista (Mathematical Reasoning of in Visual Contexts)

MathVista is a consolidated Mathematical reasoning benchmark within Visual contexts. It consists of three newly created datasets, IQTest, FunctionQA, and PaperQA, which address the missing visual domains and are tailored to evaluate logical reasoning on puzzle test figures, algebraic reasoning over functional plots, and scientific reasoning with academic paper figures, respectively. It also incorporates 9 MathQA datasets and 19 VQA datasets from the literature, which significantly enrich the diversity and complexity of visual perception and mathematical reasoning challenges within our benchmark. In total, MathVista includes 6,141 examples collected from 31 different datasets.

23 PAPERS • NO BENCHMARKS YET

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

23 PAPERS • 1 BENCHMARK

BookTest

BookTest is a new dataset similar to the popular Children’s Book Test (CBT), however more than 60 times larger.

22 PAPERS • NO BENCHMARKS YET

MUSIC-AVQA

The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

22 PAPERS • 1 BENCHMARK

ARCD

Composed of 1,395 questions posed by crowdworkers on Wikipedia articles, and a machine translation of the Stanford Question Answering Dataset (Arabic-SQuAD).

20 PAPERS • NO BENCHMARKS YET

FairytaleQA

FairytaleQA is a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Annotated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly story narratives, covering seven types of narrative elements or relations. It can support narrative Question Generation (QG) and Narrative Question Answering (QA) tasks.

20 PAPERS • 2 BENCHMARKS

MultiDoc2Dial

MultiDoc2Dial (MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents)

MultiDoc2Dial is a new task and dataset on modeling goal-oriented dialogues grounded in multiple documents. Most previous works treat document-grounded dialogue modeling as a machine reading comprehension task based on a single given document or passage. We aim to address more realistic scenarios where a goal-oriented information-seeking conversation involves multiple topics, and hence is grounded on different documents.

20 PAPERS • NO BENCHMARKS YET

WIQA (What-If Question Answering)

The WIQA dataset V1 has 39705 questions containing a perturbation and a possible effect in the context of a paragraph. The dataset is split into 29808 train questions, 6894 dev questions and 3003 test questions.

20 PAPERS • NO BENCHMARKS YET

WebChild

One of the largest commonsense knowledge bases available, describing over 2 million disambiguated concepts and activities, connected by over 18 million assertions.

20 PAPERS • NO BENCHMARKS YET

FQuAD

FQuAD (French Question Answering Dataset)

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.

19 PAPERS • 1 BENCHMARK

PrOntoQA (Proof and Ontology-Generated Question-Answering)

PrOntoQA is a question-answering dataset which generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models such as GPT-3.

19 PAPERS • NO BENCHMARKS YET

QuaRel

QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.

19 PAPERS • NO BENCHMARKS YET

CLOTH (CLOze test by TeacHers)

The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences

18 PAPERS • NO BENCHMARKS YET

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.

18 PAPERS • 1 BENCHMARK

HeadQA

HeadQA is a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans.

18 PAPERS • 1 BENCHMARK

Datasets

360 dataset results for Question Answering