9 dataset results for Text Summarization AND Texts AND Chinese

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. The authors also manually tagged the relevance of 10,666 short summaries with their corresponding short texts 10,666 short summaries with their corresponding short texts.

57 PAPERS • 2 BENCHMARKS

WikiLingua

WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.

50 PAPERS • 5 BENCHMARKS

XL-Sum

XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

44 PAPERS • NO BENCHMARKS YET

OASST1

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

14 PAPERS • NO BENCHMARKS YET

CNewSum

CNewSum is a large-scale Chinese news summarization dataset which consists of 304,307 documents and human-written summaries for the news feed. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. An additional distinguishing feature of CNewSum is that its test set contains adequacy and deducibility annotations for the summaries.

4 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 PAPER • NO BENCHMARKS YET

LSARS

LSARS (Large Scale Abstractive multi-Review Summarization)

In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.

0 PAPER • NO BENCHMARKS YET

Datasets

9 dataset results for Text Summarization AND Texts AND Chinese