🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

323 dataset results for Chinese

NPO (Negative and Positive Obstacles)

The dataset is recorded with an on-vehicle ZED stereo camera in both urban and rural environments

1 PAPER • 1 BENCHMARK

NaSGEC

NaSGEC is a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays.

1 PAPER • NO BENCHMARKS YET

OQGend

Dataset OQRanD and OQGenD for paper "Asking the crowd: Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums" by Zi Chai, Xinyu Xing, Xiaojun Wan and Bo Huang. This paper is accepted by ACL'19.

1 PAPER • NO BENCHMARKS YET

OQRanD

1 PAPER • NO BENCHMARKS YET

PNPB dataset

PNPB dataset (Pig Novelty Preference Behavior dataset)

The dataset consists of a total of 20 videos, each of which is 5.5 minutes long in duration. The videos are captured at a resolution of 1024x1024 and at 30 frames per second. Each video contains only one pig performing the Novel Object Recognition task.

1 PAPER • NO BENCHMARKS YET

PSM

PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.

1 PAPER • NO BENCHMARKS YET

PTVD

PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.

1 PAPER • NO BENCHMARKS YET

Pan+ChiPhoto

Pan+ChiPhoto dataset is a Chinese character dataset. It is built by the combination of two datasets: ChiPhoto and Pan_Chinese_Character dataset. The images in this dataset are mainly captured at outdoors in Beijing and Shanghai, China, which involve various scenes like signs, boards, advertisements, banners, objects with texts printed on their surfaces.

1 PAPER • NO BENCHMARKS YET

Perseus

Perseus is a dataset for Cross-Lingual Summarization (CLS) which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens.

1 PAPER • NO BENCHMARKS YET

QBSUM

A high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.

1 PAPER • NO BENCHMARKS YET

RU22Fact

Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022

1 PAPER • NO BENCHMARKS YET

SLING (Sino LINGuistics)

SLING consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence.

1 PAPER • NO BENCHMARKS YET

SSD_ID

SSD_ID (Sub-Slot Dialogue dataset id number domain)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".

1 PAPER • NO BENCHMARKS YET

SSD_NAME

SSD_NAME (Sub-Slot Dialogue dataset name domain)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".

1 PAPER • 1 BENCHMARK

SSD_PLATE

SSD_PLATE (Sub-Slot Dialogue dataset license plate number domain)

SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".

1 PAPER • NO BENCHMARKS YET

Simple_PS_Dataset

Photometric stereoscopic test data sets under six lights taken using laboratory equipment. Note that this dataset has no GT.

1 PAPER • NO BENCHMARKS YET

Traditional Chinese Landscape Painting Dataset

This dataset consists of 2,192 high-quality traditional Chinese landscape paintings (中国山水画). All paintings are sized 512x512, from the following sources: * Princeton University Art Museum, 362 paintings * Harvard University Art Museum, 101 paintings * Metropolitan Museum of Art, 428 paintings * Smithsonian's Freer Gallery of Art, 1,301 paintings

1 PAPER • NO BENCHMARKS YET

UAV_udc

https://github.com/zzr-idam/Under-Display-Camera-UAV

1 PAPER • NO BENCHMARKS YET

UNER v1

UNER v1 (Universal NER v1)

UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.

1 PAPER • 31 BENCHMARKS

VGGFace2 HQ

A high-resolution version of VGGFace2 for academic face editing purposes. This project uses GFPGAN for image restoration and insightface for data preprocessing (crop and align).

1 PAPER • NO BENCHMARKS YET

VTQA

VTQA (Visual Text Question Answering)

VTQA is a dataset containing open-ended questions about image-text pairs. This dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this dataset is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation. VTQA dataset consists of 10,238 image-text pairs and 27,317 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.

1 PAPER • NO BENCHMARKS YET

Voice Navigation

Voice Navigation is a large-scale dataset of Chinese speech for slot filling, containing more than 830,000 samples.

1 PAPER • NO BENCHMARKS YET

WEATHub

WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.

1 PAPER • NO BENCHMARKS YET

Weibo-Douban

Weibo-Douban (WD)

This dataset is used for user identity linkage across two online social networks in Chinese. It contains two popular Chinese social platforms: Sina Weibo\footnote{https://weibo.com} and Douban\footnote{https://www.douban.com}.

1 PAPER • NO BENCHMARKS YET

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 PAPER • NO BENCHMARKS YET

XLingEval

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

XiaChuFang Recipe Corpus

XiaChuFang Recipe Corpus contains recipes are from 下厨房 (XiaChuFang), a popular Chinese recipe sharing website. The full recipe corpus contains 1,520,327 Chinese recipes. Among them, 1,242,206 recipes belong to 30,060 dishes. A dish has 41.3 recipes on average.

1 PAPER • NO BENCHMARKS YET

XinhuaHallucinations

XinhuaHallucinations is part of UHGEval benchmark, it contains over 5000 news items. It can be used in hallucination evaluation or detection tasks.

1 PAPER • NO BENCHMARKS YET

formalgeo-imo

IMO-level geometry problem with complete natural language description, geometric shapes, formal language annotations, and theorem sequences annotations.

1 PAPER • NO BENCHMARKS YET

CBLPRD-330k

CBLPRD-330k (China-Balanced-License-Plate-Recognition-Dataset-330k)

A high-quality, balanced dataset of 330,000 images featuring various types of Chinese license plates. The dataset is generated using Generative Adversarial Networks (GANs), ensuring excellent image quality and a balanced distribution of different license plate types. This dataset is perfect for training and evaluating license plate recognition models.

0 PAPER • NO BENCHMARKS YET

CNTD (Chinese and Naxi text detection)

Chinese and Naxi scene text detection data set, labelme to json.

0 PAPER • NO BENCHMARKS YET

Hong Kong Cantonese corpus

The Hong Kong Cantonese Corpus was collected from transcribed conversations that were recorded between March 1997 and August 1998. About 230,000 Chinese words were collected in the annotated corpus. It contains recordings of spontaneous speech (51 texts) and radio programmes (42 texts), which involve 2 to 4 speakers, with 1 text of monologue. The text were word-segmented, annotated with part-of-speech tagging and Cantonese pronunciation using the romanisation scheme of Linguistic Society of Hong Kong (LSHK).

0 PAPER • NO BENCHMARKS YET

LSARS

LSARS (Large Scale Abstractive multi-Review Summarization)

In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.

0 PAPER • NO BENCHMARKS YET

SportsSum

SportsSum is a Chinese sports game summarization dataset that contains 5,428 soccer games of live commentaries and the corresponding news articles.

0 PAPER • NO BENCHMARKS YET

aidatatang_200zh

A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:

0 PAPER • NO BENCHMARKS YET

Datasets

323 dataset results for Chinese