6 dataset results for Chinese Reading Comprehension AND Texts

CMRC (Chinese Machine Reading Comprehension)

CMRC is a dataset is annotated by human experts with near 20,000 questions as well as a challenging set which is composed of the questions that need reasoning over multiple clues.

63 PAPERS • 11 BENCHMARKS

DRCD (Delta Reading Comprehension Dataset)

Delta Reading Comprehension Dataset (DRCD) is an open domain traditional Chinese machine reading comprehension (MRC) dataset. This dataset aimed to be a standard Chinese machine reading comprehension dataset, which can be a source dataset in transfer learning. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators.

52 PAPERS • 5 BENCHMARKS

CMRC 2018

CMRC 2018 (Chinese Machine Reading Comprehension 2018)

CMRC 2018 is a dataset for Chinese Machine Reading Comprehension. Specifically, it is a span-extraction reading comprehension dataset that is similar to SQuAD.

46 PAPERS • 7 BENCHMARKS

Belebele

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.

19 PAPERS • NO BENCHMARKS YET

CJRC

CJRC (Chinese judicial reading comprehension)

The Chinese judicial reading comprehension (CJRC) dataset contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts.

9 PAPERS • 2 BENCHMARKS

DMQA

DMQA (DeepMind Q&A)

The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.

3 PAPERS • NO BENCHMARKS YET

Datasets

6 dataset results for Chinese Reading Comprehension AND Texts