In the paper, to bridge the research gap, we propose a new and important task, Profile-based Spoken Language Understanding (ProSLU), which requires a model not only depends on the text but also on the given supporting profile information. We further introduce a Chinese human-annotated dataset, with over 5K utterances annotated with intent and slots, and corresponding supporting profile information. In total, we provide three types of supporting profile information: (1) Knowledge Graph (KG) consists of entities with rich attributes, (2) User Profile (UP) is composed of user settings and information, (3) Context Awareness(CA) is user state and environmental information.
2 PAPERS • 3 BENCHMARKS
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots". arxiv
2 PAPERS • NO BENCHMARKS YET
SSD (Sub-slot Dialog) dataset: This is the dataset for the ACL 2022 paper "A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots".
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
VGaokao is a verification style reading comprehension dataset designed for native speakers' evaluation.
Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.
The XL-R2R dataset is built upon the R2R dataset and extends it with Chinese instructions. XL-R2R preserves the same splits as in R2R and thus consists of train, val-seen, and val-unseen splits with both English and Chinese instructions, and test split with English instructions only.
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.
1 PAPER • NO BENCHMARKS YET
A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.
Chinese Character Stroke Extraction (CCSE) is a benchmark containing two large-scale datasets: Kaiti CCSE (CCSE-Kai) and Handwritten CCSE (CCSE-HW). It is designed for stroke extraction problems.
CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not, is used for the CHIP-CTC task. All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpus for the Chinese clinical trial criterion are still limited, and we hope to promote future researches for social benefits.
1 PAPER • 1 BENCHMARK
CHORD is the first chorus recognition dataset containing 627 songs for public use.
Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is aimed at identifying both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other sub-relationships.
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
The Chinese Stock Policy Retrieval Dataset (CSPRD) contains a Chinese policy corpus of 10,002 articles and 709 prospectus examples from 545 companies listed on China’s Science and Technology Innovation Board (STAR Market). CSPRD is bilingual in Chinese and English (Translated by ChatGPT) and is annotated by experienced experts from Shanghai Stock Exchange.
ChinaOpen is a new video dataset targeted at open-world multimodal learning, with raw data gathered from Bilibili, a popular Chinese video-sharing website. The dataset has a large webly annotated training set of videos (associated with user-generated titles and tags) and a smaller manually annotated test set of videos (with manually checked user titles / tags, manually written captions, and manual labels describing what visual objects / actions / scenes shown in the visual content).
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
Conic10K is an open-ended math problem dataset on conic sections in Chinese senior high school education. This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales. These questions require long reasoning steps while the topic is limited to conic sections. It could be used to evaluate models with 2 tasks: semantic parsing and mathematical question answering (mathQA).
ConvSumX is a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.
This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China
DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.
a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020
Fashion-MNT is large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images.
We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
IEE is a financial-domain dataset of the Insurance-entity extraction task. Its goal is to locate named entities mentioned in the input sentence.
JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.
Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.
An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.
The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.
MTC is a financial-domain dataset of the multi-label topic classification task. It aims to identify the topics of the spoken dialogue.
MUSIED is a large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service, designed for event detection tasks.
The benchmark assesses the real-time interactive consultation capabilities of LLMs across three critical dimensions. We collect Chinese medical records across diverse departments online.
Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.
MiniWob++ is a suite of web-browser based tasks introduced in Liu et al. (2018) (an extension of the earlier MiniWob task suite (Shi et al., 2017)). Tasks range from simple button clicking to complex form-filling, for example, to book a flight when given particular instructions (Fig. 1a). Programmatic rewards are available for each task, permitting standard reinforcement learning techniques.
Mint is a new Multilingual intimacy analysis dataset covering 13,384 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis.
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by the Speech & Language Technology group of DFKI by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances).
NaSGEC is a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains. Previous CGEC research primarily focuses on correcting texts from a single domain, especially learner essays.
Dataset OQRanD and OQGenD for paper "Asking the crowd: Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums" by Zi Chai, Xinyu Xing, Xiaojun Wan and Bo Huang. This paper is accepted by ACL'19.
PSM is a financial-domain dataset of the pairwise search matching task. It aims to identify the semantic similarity of a sentence pair in the search scenario.
PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.
Pan+ChiPhoto dataset is a Chinese character dataset. It is built by the combination of two datasets: ChiPhoto and Pan_Chinese_Character dataset. The images in this dataset are mainly captured at outdoors in Beijing and Shanghai, China, which involve various scenes like signs, boards, advertisements, banners, objects with texts printed on their surfaces.
A high-quality large-scale dataset consisting of 49,000+ data samples for the task of Chinese query-based document summarization.
Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022
SLING consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence.