🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

323 dataset results for Chinese

BMELD

BMELD is a bilingual (English-Chinese) dialogue corpus for Neural chat translation.

6 PAPERS • NO BENCHMARKS YET

CAIS

CAIS (Chinese Artificial Intelligence Speakers)

We collect utterances from the Chinese Artificial Intelligence Speakers (CAIS), and annotate them with slot tags and intent labels. The training, validation and test sets are split by the distribution of intents, where detailed statistics are provided in the supplementary material. Since the utterances are collected from speaker systems in the real world, intent labels are partial to the PlayMusic option. We adopt the BIOES tagging scheme for slots instead of the BIO2 used in the ATIS, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009) in the sequence labeling field

6 PAPERS • 2 BENCHMARKS

ChineseFoodNet

ChineseFoodNet aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In the dataset, images of each food category of the dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food.

6 PAPERS • NO BENCHMARKS YET

Demetr

Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.

6 PAPERS • NO BENCHMARKS YET

KUAKE-QQR

KUAKE-QQR (Query-Query Relevance Dataset)

KUAKE Query-Query Relevance, a dataset used to evaluate the relevance of the content expressed in two queries, is used for the KUAKE-QQR task. Similar to KUAKE-QTR, the task aims to estimate query-query relevance, which is an essential and challenging task in real-world search engines.

6 PAPERS • 1 BENCHMARK

KUAKE-QTR

KUAKE-QTR (Query-Title Relevance Dataset)

KUAKE Query Title Relevance, a dataset used to estimate the relevance of the title of a query document, is used for the KUAKE-QTR task. Given a query (e.g., “Symptoms of vitamin B deficiency”), the task aims to find the relevant title (e.g., “The main manifestations of vitamin B deficiency”).

6 PAPERS • 1 BENCHMARK

KaMed

KaMed is a knowledge-aware medical dialogue dataset, which contains over 60,000 medical dialogue sessions with 5,682 entities (such as Asthma and Atropine).

6 PAPERS • NO BENCHMARKS YET

LCQMC

LCQMC (Large-scale Chinese Question Matching Corpus)

LCQMC is a large-scale Chinese question matching corpus. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. The corpus contains 260,068 question pairs with manual annotation.

6 PAPERS • NO BENCHMARKS YET

MedDialog

The MedDialog dataset (Chinese) contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added. The raw dialogues are from haodf.com. All copyrights of the data belong to haodf.com.

6 PAPERS • NO BENCHMARKS YET

XL-BEL

XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages.

6 PAPERS • NO BENCHMARKS YET

XQA

XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.

6 PAPERS • NO BENCHMARKS YET

ASCEND

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese code-switching corpus collected in Hong Kong. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 10.62 hours clean speech corpus.

5 PAPERS • NO BENCHMARKS YET

CASME II

CASME II (Chinese Academy of Sciences Micro-Expression II)

The Chinese Academy of Sciences Micro-Expression dataset (CASME II) consists of 255 videos, elicited from 26 participants. The videos are recorded using Point Gray GRAS-03K2C camera which has a frame rate of 200fps. The average video length is 0.34s, equivalent to 68 frames. Each video’s emotion label is annotated by two coders, where the reliability is 0.846.

5 PAPERS • 1 BENCHMARK

CHECKED

Chinese dataset on COVID-19 misinformation. CHECKED provides ground-truth on credibility, carefully obtained by ensuring the specific sources are used. CHECKED includes microblogs related to COVID-19, identified by using a specific list of keywords, covering a total 2120 microblogs published from December 2019 to August 2020. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, response, and social network information.

5 PAPERS • NO BENCHMARKS YET

CHIP-CDN

CHIP-CDN (Clinical Diagnosis Normalization Dataset)

CHIP Clinical Diagnosis Normalization, a dataset that aims to standardize the terms from the final diagnoses of Chinese electronic medical records, is used for the CHIP-CDN task. Given the original phrase, the task is required to normalize it to standard terminology based on the International Classification of Diseases (ICD-10) standard for Beijing Clinical Edition v601.

5 PAPERS • 1 BENCHMARK

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

5 PAPERS • NO BENCHMARKS YET

Chinese Text in the Wild

Chinese Text in the Wild is a dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.

5 PAPERS • NO BENCHMARKS YET

Concepticon (Concepticon. A Resource for the Linking of Concept Lists)

This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts, as you can see in the graphic to the right, which displays the relations between concept sets linked to the concept set SIBLING. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.

5 PAPERS • NO BENCHMARKS YET

DogWhistle

Cant (also known as doublespeak, cryptolect, argot, anti-language or secret language) is important for understanding advertising, comedies and dog-whistle politics. DogWhistle is a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective.

5 PAPERS • NO BENCHMARKS YET

DurLAR (A High-Fidelity 128-Channel LiDAR Dataset with Panoramic Ambient and Reflectivity Imagery)

DurLAR is a high-fidelity 128-channel 3D LiDAR dataset with panoramic ambient (near infrared) and reflectivity imagery for multi-modal autonomous driving applications. Compared to existing autonomous driving task datasets, DurLAR has the following novel features:

5 PAPERS • NO BENCHMARKS YET

HallusionBench

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs.

5 PAPERS • 1 BENCHMARK

Lyra

Lyra is a dataset for code generation that consists on Python code with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real usage projects. Each program is paired with both a Chinese comment and an English comment.

5 PAPERS • NO BENCHMARKS YET

MATINF (Maternal and Infant Dataset)

Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).

5 PAPERS • NO BENCHMARKS YET

RRS Ranking Test

RRS Ranking Test (Restoration-200k for Response Selection with Ranking Test Set)

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |

5 PAPERS • 1 BENCHMARK

RiSAWOZ

RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.

5 PAPERS • NO BENCHMARKS YET

UMVM

We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.

5 PAPERS • 7 BENCHMARKS

WebCPM

WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.

5 PAPERS • NO BENCHMARKS YET

XImageNet-12 (XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation)

Enlarge the dataset to understand how image background effect the Computer Vision ML model. With the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment!

5 PAPERS • 1 BENCHMARK

BiPaR

BiPaR is a manually annotated bilingual parallel novel-style machine reading comprehension (MRC) dataset, developed to support monolingual, multilingual and cross-lingual reading comprehension on novels. The biggest difference between BiPaR and existing reading comprehension datasets is that each triple (Passage, Question, Answer) in BiPaR is written in parallel in two languages. BiPaR is diverse in prefixes of questions, answer types and relationships between questions and passages. Answering the questions requires reading comprehension skills of coreference resolution, multi-sentence reasoning, and understanding of implicit causality.

4 PAPERS • NO BENCHMARKS YET

CNewSum

CNewSum is a large-scale Chinese news summarization dataset which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set contains adequacy and deducibility annotations for the summaries.

4 PAPERS • NO BENCHMARKS YET

CUGE

CUGE is a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework.

4 PAPERS • NO BENCHMARKS YET

ChatHaruhi (ChatHaruhi: Reviving Anime Character in Reality via Large Language Model)

ChatHaruhi is a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues.

4 PAPERS • NO BENCHMARKS YET

ChineseLP

The ChineseLP dataset contains 411 vehicle images (mostly of passenger cars) with Chinese license plates (LPs). It consists of 252 images captured by the authors and 159 images downloaded from the internet. The images present great variations in resolution (from 143 × 107 to 2048 × 1536 pixels), illumination and background.

4 PAPERS • 1 BENCHMARK

FRMT (Few-shot Region-aware Machine Translation)

FRMT is a dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of human translations of a few thousand English Wikipedia sentences into regional variants of Portuguese and Mandarin. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms.

4 PAPERS • 4 BENCHMARKS

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

WikiSem500

The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).

4 PAPERS • NO BENCHMARKS YET

YACLC

YACLC (Yet Another Chinese Learner Corpus)

YACLC is a large scale, multidimensional annotated Chinese learner corpus. To construct the corpus, the aurhots first obtain a large number of topic-rich texts generated by Chinese as Foreign Language (CFL) learners. The authors collected and annotated 32,124 sentences written by CFL learners from the lang-8 platform. Each sentence is annotated by 10 annotators. After post processing, a total of 469,000 revised sentences are obtained.

4 PAPERS • NO BENCHMARKS YET

BenchIE

BenchIE: a benchmark and evaluation framework for comprehensive evaluation of OIE systems for English, Chinese and German. In contrast to existing OIE benchmarks, BenchIE takes into account informational equivalence of extractions: our gold standard consists of fact synsets, clusters in which we exhaustively list all surface forms of the same fact.

3 PAPERS • 1 BENCHMARK

CCPM (Chinese Classical Poetry Matching)

Introduction

3 PAPERS • NO BENCHMARKS YET

COS960

A benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness.

3 PAPERS • NO BENCHMARKS YET

CPP

CPP (Chinese Polyphones with Pinyin)

A benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.

3 PAPERS • 1 BENCHMARK

CS (Chinese Simile)

This dataset is constructed and based on the online free-access fictions that are tagged with sci-fi, urban novel, love story, youth, etc. It is used for Writing Polishment with Smile (WPS) a task that aims to polish plain text with similes. All similes are extracted by rich regular expression, and the extraction precision is estimated as 92% by labelling 500 random extracted samples. It contains 5M samples for training and 2.5k for validation and test respectively.

3 PAPERS • NO BENCHMARKS YET

DISRPT2021

DISRPT2021 (DISRPT2021 shared task on Discourse Unit Segmentation, Connective Detection and Discourse Relation Classification)

The DISRPT 2021 shared task, co-located with CODI 2021 at EMNLP, introduces the second iteration of a cross-formalism shared task on discourse unit segmentation and connective detection, as well as the first iteration of a cross-formalism discourse relation classification task.

3 PAPERS • NO BENCHMARKS YET

DiaASQ (Conversational Aspect-based Sentiment Quadruple Extraction)

DiaASQ is a fine-grained Aspect-based Sentiment Analysis (ABSA) benchmark under the conversation scenario. It challenges existing ABSA methods by 1) extracting quadruple of target-aspect-opinion-sentiment in a dialogue, and 2) modeling the dialogue discourse structures. The dataset is constructed by systematically crawling tweets from digital bloggers, followed by a series of preprocessing steps including filtering, normalizing, pruning, and annotating the collected dialogues, resulting in a final corpus of 1,000 dialogues. To enhance the multilingual usability, DiaASQ has both the English and Chinese versions of languages.

3 PAPERS • 2 BENCHMARKS

Diamante

Diamante is a novel and efficient framework consisting of a data collection strategy and a learning method to boost the performance of pre-trained dialogue models. Two kinds of human feedback are collected and leveraged in Diamante, including explicit demonstration and implicit preference. The Diamante dataset is publicly available at the LUGE platform.

3 PAPERS • NO BENCHMARKS YET

ExplainCPE

This is a medical multiple-choice dataset with explanations which can be used to interpret the answer. The data comes from Chinese Pharmacist Examination. Each piece of data has a question, five options, a gold_answer and a gold_explanation.

3 PAPERS • NO BENCHMARKS YET

GeoCoV19

GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.

3 PAPERS • NO BENCHMARKS YET

HUME-VB

HUME-VB (The Hume Vocal Bursts Dataset)

The Hume Vocal Burst Database (H-VB) includes all train, validation, and test recordings and corresponding emotion ratings for the train and validation recordings.

3 PAPERS • 7 BENCHMARKS

Datasets

323 dataset results for Chinese