The dataset is recorded with an on-vehicle ZED stereo camera in both urban and rural environments
1 PAPER • 1 BENCHMARK
NaSGEC is a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays.
1 PAPER • NO BENCHMARKS YET
Dataset OQRanD and OQGenD for paper "Asking the crowd: Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums" by Zi Chai, Xinyu Xing, Xiaojun Wan and Bo Huang. This paper is accepted by ACL'19.
The dataset consists of a total of 20 videos, each of which is 5.5 minutes long in duration. The videos are captured at a resolution of 1024x1024 and at 30 frames per second. Each video contains only one pig performing the Novel Object Recognition task.
PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.
PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.
Pan+ChiPhoto dataset is a Chinese character dataset. It is built by the combination of two datasets: ChiPhoto and Pan_Chinese_Character dataset. The images in this dataset are mainly captured at outdoors in Beijing and Shanghai, China, which involve various scenes like signs, boards, advertisements, banners, objects with texts printed on their surfaces.
Perseus is a dataset for Cross-Lingual Summarization (CLS) which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens.
A high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.
Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022
SLING consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence.
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".
Photometric stereoscopic test data sets under six lights taken using laboratory equipment. Note that this dataset has no GT.
This dataset consists of 2,192 high-quality traditional Chinese landscape paintings (中国山水画). All paintings are sized 512x512, from the following sources: * Princeton University Art Museum, 362 paintings * Harvard University Art Museum, 101 paintings * Metropolitan Museum of Art, 428 paintings * Smithsonian's Freer Gallery of Art, 1,301 paintings
https://github.com/zzr-idam/Under-Display-Camera-UAV
UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.
1 PAPER • 31 BENCHMARKS
A high-resolution version of VGGFace2 for academic face editing purposes. This project uses GFPGAN for image restoration and insightface for data preprocessing (crop and align).
VTQA is a dataset containing open-ended questions about image-text pairs. This dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this dataset is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation. VTQA dataset consists of 10,238 image-text pairs and 27,317 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.
Voice Navigation is a large-scale dataset of Chinese speech for slot filling, containing more than 830,000 samples.
WEATHub is a dataset containing 24 languages. It contains words organized into groups of (target1, target2, attribute1, attribute2) to measure the association target1:target2 :: attribute1:attribute2. For example target1 can be insects, target2 can be flowers. And we might be trying to measure whether we find insects or flowers pleasant or unpleasant. The measurement of word associations is quantified using the WEAT metric in our paper. It is a metric that calculates an effect size (Cohen's d) and also provides a p-value (to measure statistical significance of the results). In our paper, we use word embeddings from language models to perform these tests and understand biased associations in language models across different languages.
This dataset is used for user identity linkage across two online social networks in Chinese. It contains two popular Chinese social platforms: Sina Weibo\footnote{https://weibo.com} and Douban\footnote{https://www.douban.com}.
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
XiaChuFang Recipe Corpus contains recipes are from 下厨房 (XiaChuFang), a popular Chinese recipe sharing website. The full recipe corpus contains 1,520,327 Chinese recipes. Among them, 1,242,206 recipes belong to 30,060 dishes. A dish has 41.3 recipes on average.
XinhuaHallucinations is part of UHGEval benchmark, it contains over 5000 news items. It can be used in hallucination evaluation or detection tasks.
IMO-level geometry problem with complete natural language description, geometric shapes, formal language annotations, and theorem sequences annotations.
A high-quality, balanced dataset of 330,000 images featuring various types of Chinese license plates. The dataset is generated using Generative Adversarial Networks (GANs), ensuring excellent image quality and a balanced distribution of different license plate types. This dataset is perfect for training and evaluating license plate recognition models.
0 PAPER • NO BENCHMARKS YET
Chinese and Naxi scene text detection data set, labelme to json.
The Hong Kong Cantonese Corpus was collected from transcribed conversations that were recorded between March 1997 and August 1998. About 230,000 Chinese words were collected in the annotated corpus. It contains recordings of spontaneous speech (51 texts) and radio programmes (42 texts), which involve 2 to 4 speakers, with 1 text of monologue. The text were word-segmented, annotated with part-of-speech tagging and Cantonese pronunciation using the romanisation scheme of Linguistic Society of Hong Kong (LSHK).
In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.
SportsSum is a Chinese sports game summarization dataset that contains 5,428 soccer games of live commentaries and the corresponding news articles.
A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include: