WikiLingua includes ~770k article and summary pairs in 18 languages from WikiHow. Gold-standard article-summary alignments across languages are extracted by aligning the images that are used to describe each how-to step in an article.
50 PAPERS • 5 BENCHMARKS
XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
43 PAPERS • NO BENCHMARKS YET
Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.
8 PAPERS • NO BENCHMARKS YET
This is a dataset for multi-document summarization in Portuguese, what means that it has examples of multiple documents (input) related to human-written summaries (output). In particular, it has entries of multiple related texts from Brazilian websites about a subject, and the summary is the Portuguese Wikipedia lead section on the same subject (lead: the first section, i.e., summary, of any Wipedia article). Input texts were extracted from BrWac corpus, and the output from Brazilian Wikipedia dumps page.
1 PAPER • NO BENCHMARKS YET