The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language understanding task.
12 PAPERS • NO BENCHMARKS YET
GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to the collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.
11 PAPERS • NO BENCHMARKS YET
The first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches.
QAMPARI is an ODQA benchmark, where question answers are lists of entities, spread across many paragraphs. It was created by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer.
CLEVR-Dialog is a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, that authors construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.
10 PAPERS • NO BENCHMARKS YET
ConditionalQA is a Question Answering (QA) dataset that contains complex questions with conditional answers, i.e. the answers are only applicable when certain conditions apply.
10 PAPERS • 1 BENCHMARK
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
GeoQA is a dataset for automatic geometric problem solving containing 5,010 geometric problems with corresponding annotated programs, which illustrate the solving process of the given problems
MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.
Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.
We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.
BeaverTails is a dataset aimed at fostering research on safety alignment in large language models (LLMs). This dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. In total, the authors have compiled safety meta-labels for 30,207 question-answer (QA) pairs and gathered 30,144 pairs of expert comparison data for both
9 PAPERS • NO BENCHMARKS YET
CQASUMM is a dataset for CQA (Community Question Answering) summarization, constructed from the 4.4 million Yahoo! Answers L6 dataset. The dataset contains ~300k annotated samples.
ComQA is a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology.
Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.
JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.
The PROST (Physical Reasoning about Objects Through Space and Time) dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting.
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
SelQA is a dataset that consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia.
A corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations---including not only comments and replies, but also their modifications, deletions and restorations---this data offers an unprecedented view of online conversation.
The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.
8 PAPERS • NO BENCHMARKS YET
KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.
8 PAPERS • 1 BENCHMARK
OPIEC is an Open Information Extraction (OIE) corpus, constructed from the entire English Wikipedia. It containing more than 341M triples. Each triple from the corpus is composed of rich meta-data: each token from the subj / obj / rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence (along with its dependency parse, sentence order relative to the article), original (golden) links contained in the Wikipedia articles, space / time.
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
QUASAR-S is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of 37,362 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The answer to each question is restricted to be another software entity, from an output vocabulary of 4874 entities.
Consists of multiple sentences whose clues are arranged by difficulty (from obscure to obvious) and uniquely identify a well-known entity such as those found on Wikipedia.
SpartQA is a textual question answering benchmark for spatial reasoning on natural language text which contains more realistic spatial phenomena not covered by prior datasets and that is challenging for state-of-the-art language models (LM).
We take advantage of the ground truth of NLVR images, design CFGs to generate stories, and use spatial reasoning rules to ask and answer spatial reasoning questions. This automatically generated data is called SpaRTQA. https://aclanthology.org/2021.naacl-main.364/
A Benchmark for Robust Multi-Hop Spatial Reasoning in Texts
ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.
DUDE is formulated as an instance of Document Question Answering (DocQA) to evaluate how well current solutions deal with multi-page documents, if they can navigate and reason over the layout, and if they can generalize these skills to different document types and domains. Since we cannot provide question-answer pairs about, e.g., ticked checkboxes, on each document instance or document type, the challenge presented by DUDE is characterized equally as a Multi-Domain Long-Tailed Recognition problem
7 PAPERS • NO BENCHMARKS YET
DaNetQA is a question answering dataset for yes/no questions. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings.
7 PAPERS • 1 BENCHMARK
RadQA is a radiology question answering dataset with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines.
SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, t
7 PAPERS • 9 BENCHMARKS
In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications, published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. To the best of our knowledge, this is the first QA dataset for scholarly KGs.
6 PAPERS • NO BENCHMARKS YET
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
6 PAPERS • 1 BENCHMARK
IQUAD is a dataset for Visual Question Answering in interactive environments. It is built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive object. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration.
A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.
We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.
RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.
XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.
GermanQuAD is a Question Answering (QA) dataset of 13,722 extractive question/answer pairs in German.
5 PAPERS • NO BENCHMARKS YET
KAMEL comprises knowledge about 234 relations from Wikidata with a large training, validation, and test dataset. We make sure that all facts are also present in Wikipedia so that they have been seen during the pre-training procedure of the LMs we are probing. Most importantly we overcome the limitations of existing probing datasets by (1) having a larger variety of knowledge graph relations, (2) it contains single- and multi-token entities, (3) we use relations with literals, and (4) have alternative labels for entities. (5) Furthermore, we created an evaluation procedure for higher cardinality relations, which was missing in previous works, and (6) make sure that the dataset can be used for causal LMs.
5 PAPERS • 1 BENCHMARK
LEAF-QA, a comprehensive dataset of 250,000 densely annotated figures/charts, constructed from real-world open data sources, along with ~2 million question-answer (QA) pairs querying the structure and semantics of these charts. LEAF-QA highlights the problem of multimodal QA, which is notably different from conventional visual QA (VQA), and has recently gained interest in the community. Furthermore, LEAF-QA is significantly more complex than previous attempts at chart QA, viz. FigureQA and DVQA, which present only limited variations in chart data. LEAF-QA being constructed from real-world sources, requires a novel architecture to enable question answering.
Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. This page records LLMs for answer evaluation.
QAConv is a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool.