The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.
10 PAPERS • 3 BENCHMARKS
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
10 PAPERS • NO BENCHMARKS YET
GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to the collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.
A challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool.
MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.
Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.
We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.
10 PAPERS • 1 BENCHMARK
BeaverTails is a dataset aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, the authors have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both
9 PAPERS • NO BENCHMARKS YET
CQASUMM is a dataset for CQA (Community Question Answering) summarization, constructed from the 4.4 million Yahoo! Answers L6 dataset. The dataset contains ~300k annotated samples.
ComQA is a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology.
Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.
JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.
The PROST (Physical Reasoning about Objects Through Space and Time) dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting.
9 PAPERS • 1 BENCHMARK
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
SelQA is a dataset that consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia.
A corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations---including not only comments and replies, but also their modifications, deletions and restorations---this data offers an unprecedented view of online conversation.
The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.
8 PAPERS • NO BENCHMARKS YET
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
8 PAPERS • 1 BENCHMARK
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
QUASAR-S is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 37,362 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The answer to each question is restricted to be another software entity, from an output vocabulary of 4874 entities.
Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.
SpartQA is a textual question answering benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior datasets and that is challenging for state-of-the-art language models (LM).
We take advantage of the ground truth of NLVR images, design CFGs to generate stories, and use spatial reasoning rules to ask and answer spatial reasoning questions. This automatically generated data is called SpaRTQA. https://aclanthology.org/2021.naacl-main.364/
Provides four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data.
8 PAPERS • 4 BENCHMARKS
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
A configurable visual question and answer dataset (COG) to parallel experiments in humans and animals. COG is much simpler than the general problem of video analysis, yet it addresses many of the problems relating to visual and logical reasoning and memory -- problems that remain challenging for modern deep learning architectures.
7 PAPERS • NO BENCHMARKS YET
DIOR-RSVG is a large-scale benchmark dataset of remote sensing data (RSVG). It aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models.
DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE is characterized equally as a Multi-Domain Long-Tailed Recognition problem
DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings.
7 PAPERS • 1 BENCHMARK
EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos.
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.
RadQA is a radiology question answering dataset with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines.
A large scale analogue of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community.
SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
7 PAPERS • 9 BENCHMARKS
The question-answer (QA) pairs are automatically generated using state-of-the-art question generation methods based on paintings and comments provided in an existing art understanding dataset. The QA pairs are cleansed by crowdsourcing workers with respect to their grammatical correctness, answerability, and answers' correctness. The dataset inherently consists of visual (painting-based) and knowledge (comment-based) questions.
6 PAPERS • NO BENCHMARKS YET
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
6 PAPERS • 2 BENCHMARKS
IQUAD is a dataset for Visual Question Answering in interactive environments. It is built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive object. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration.
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.
6 PAPERS • 1 BENCHMARK
RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.
XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.
AI2D-RST is a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowd-sourced descriptions, which was originally developed to support research on automatic diagram understanding and visual question answering.
5 PAPERS • NO BENCHMARKS YET
GermanQuAD is a Question Answering (QA) dataset of 13,722 extractive question/answer pairs in German.
KAMEL comprises knowledge about 234 relations from Wikidata with a large training, validation, and test dataset. We make sure that all facts are also present in Wikipedia so that they have been seen during the pre-training procedure of the LMs we are probing. Most importantly we overcome the limitations of existing probing datasets by (1) having a larger variety of knowledge graph relations, (2) it contains single- and multi-token entities, (3) we use relations with literals, and (4) have alternative labels for entities. (5) Furthermore, we created an evaluation procedure for higher cardinality relations, which was missing in previous works, and (6) make sure that the dataset can be used for causal LMs.
5 PAPERS • 1 BENCHMARK