CHIP Clinical Diagnosis Normalization, a dataset that aims to standardize the terms from the final diagnoses of Chinese electronic medical records, is used for the CHIP-CDN task. Given the original phrase, the task is required to normalize it to standard terminology based on the International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.
5 PAPERS • 1 BENCHMARK
CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
5 PAPERS • NO BENCHMARKS YET
Cant (also known as doublespeak, cryptolect, argot, anti-language or secret language) is important for understanding advertising, comedies and dog-whistle politics. DogWhistle is a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective.
Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs.
Lyra is a dataset for code generation that consists on Python code with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment.
Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).
| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
5 PAPERS • 7 BENCHMARKS
WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.
BiPaR is a manually annotated bilingual parallel novel-style machine reading comprehension (MRC) dataset, developed to support monolingual, multilingual and cross-lingual reading comprehension on novels. The biggest difference between BiPaR and existing reading comprehension datasets is that each triple (Passage, Question, Answer) in BiPaR is written in parallel in two languages. BiPaR is diverse in prefixes of questions, answer types and relationships between questions and passages. Answering the questions requires reading comprehension skills of coreference resolution, multi-sentence reasoning, and understanding of implicit causality.
4 PAPERS • NO BENCHMARKS YET
CNewSum is a large-scale Chinese news summarization dataset which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set contains adequacy and deducibility annotations for the summaries.
CUGE is a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework.
FRMT is a dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of human translations of a few thousand English Wikipedia sentences into regional variants of Portuguese and Mandarin. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms.
4 PAPERS • 4 BENCHMARKS
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).
YACLC is a large scale, multidimensional annotated Chinese learner corpus. To construct the corpus, the aurhots first obtain a large number of topic-rich texts generated by Chinese as Foreign Language (CFL) learners. The authors collected and annotated 32,124 sentences written by CFL learners from the lang-8 platform. Each sentence is annotated by 10 annotators. After post processing, a total of 469,000 revised sentences are obtained.
BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to existing OIE benchmarks, BenchIE takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all surface forms of the same fact.
3 PAPERS • 1 BENCHMARK
Introduction
3 PAPERS • NO BENCHMARKS YET
This dataset is constructed and based on the online free-access fictions that are tagged with sci-fi, urban novel, love story, youth, etc. It is used for Writing Polishment with Smile (WPS) a task that aims to polish plain text with similes. All similes are extracted by rich regular expression, and the extraction precision is estimated as 92% by labelling 500 random extracted samples. It contains 5M samples for training and 2.5k for validation and test respectively.
ChatHaruhi is a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues.
The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.
DiaASQ is a fine-grained Aspect-based Sentiment Analysis (ABSA) benchmark under the conversation scenario. It challenges existing ABSA methods by 1) extracting quadruple of target-aspect-opinion-sentiment in a dialogue, and 2) modeling the dialogue discourse structures. The dataset is constructed by systematically crawling tweets from digital bloggers, followed by a series of preprocessing steps including filtering, normalizing, pruning, and annotating the collected dialogues, resulting in a final corpus of 1,000 dialogues. To enhance the multilingual usability, DiaASQ has both the English and Chinese versions of languages.
3 PAPERS • 2 BENCHMARKS
Diamante is a novel and efficient framework consisting of a data collection strategy and a learning method to boost the performance of pre-trained dialogue models. Two kinds of human feedback are collected and leveraged in Diamante, including explicit demonstration and implicit preference. The Diamante dataset is publicly available at the LUGE platform.
GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.
Overview LEVEN is the largest Legal Event Detection dataset as well as the largest Chinese Event Detection dataset.
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
The Sina Weibo Sexism Review (SWSR) dataset is a dataset to research online sexism in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language.
Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.
WDC-Dialogue is a dataset built from the Chinese social media to train EVA. Specifically, conversations from various sources are gathered and a rigorous data cleaning pipeline is designed to enforce the quality of WDC-Dialogue.
Dataset Description The dataset described in the provided text is focused on social media polls collected from Weibo, a popular Chinese microblogging platform. The dataset aims to empirically study social media polls and analyze user engagement patterns.
3 PAPERS • 3 BENCHMARKS
Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.
XWINO is a multilingual collection of Winograd Schemas in six languages that can be used for evaluation of cross-lingual commonsense reasoning capabilities.
CA4P-483 is a dataset designed to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. It contains 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations.
2 PAPERS • NO BENCHMARKS YET
Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), a dataset containing 40,000 annotated sentences from real posts of official media on Sina Weibo. It is designed to detect and correct spelling mistakes in Chinese texts.
Classifiers are function words that are used to express quantities in Chinese and are especially difficult for language learners. This dataset of Chinese Classifiers can be used to predict Chinese classifiers from context. The dataset contains a large collection of example sentences for Chinese classifier usage derived from three language corpora (Lancaster Corpus of Mandarin Chinese, UCLA Corpus of Written Chinese and Leiden Weibo Corpus). The data was cleaned and processed for a context-based classifier prediction task.
Chinese Gigaword corpus consists of 2.2M of headline-document pairs of news stories covering over 284 months from two Chinese newspapers, namely the Xinhua News Agency of China (XIN) and the Central News Agency of Taiwan (CNA).
DialogUSR dataset covers 23 domains with a multi-step crowd-sourcing procedure. It comprises 36.7 Chinese characters by assembling 3.6 single-intent queries (including initial and follow-up queries) and is designed for dialogue utterance splitting and reformulation task.
ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.
2 PAPERS • 4 BENCHMARKS
Pretrain: 200k Instruction: 100k
Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:
K-SportsSum is a sports game summarization dataset with two characteristics: (1) K-SportsSum collects a large amount of data from massive games. It has 7,854 commentary-news pairs. To improve the quality, K-SportsSum employs a manual cleaning process; (2) Different from existing datasets, to narrow the knowledge gap, K-SportsSum further provides a large-scale knowledge corpus that contains the information of 523 sports teams and 14,724 sports players.
MCSCSet is a large-scale specialist-annotated dataset, designed for the task of Medical-domain Chinese Spelling Correction that contains about 200k samples. MCSCSet involves: i) extensive real-world medical queries collected from Tencent Yidian, ii) corresponding misspelled sentences manually annotated by medical specialists.
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
2 PAPERS • 2 BENCHMARKS
MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).
OIR is a financial-domain dataset of the outbound intent recognition task. It aims to identify the intent of customer response in the outbound call scenario.
PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.