🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

88 dataset results for Text Summarization

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, validation, and test datasets, these pairs were sorted by time and the first 52,400 pairs are used as the training dataset, the proceeding 5,265 pairs as the validation dataset, and the remaining 5,770 pairs as the test dataset.

4 PAPERS • 1 BENCHMARK

IndoNLG

IndoNLG is a benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.

4 PAPERS • NO BENCHMARKS YET

Klexikon

Klexikon (Klexikon: A German Dataset for Joint Summarization and Simplification)

The dataset introduces document alignments between German Wikipedia and the children's lexicon Klexikon. The source texts in Wikipedia are both written in a more complex language than Klexikon, and also significantly longer, which makes this a suitable application for both summarization and simplification. In fact, previous research has so far only focused on either of the two, but not comprehensively been studied as a joint task.

4 PAPERS • 1 BENCHMARK

CHQ-Summ (Consumer Healthcare Question Summarization)

Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.

3 PAPERS • NO BENCHMARKS YET

EurekaAlert

EurekaAlert (Eureka Alert)

This dataset contains around 5000 scholarly articles and their corresponding easy summary from eureka alert blog, the dataset can be used for the combined task of summarization and simplification.

3 PAPERS • 1 BENCHMARK

PerKey

A corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases.

3 PAPERS • NO BENCHMARKS YET

PubMedCite

PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.

3 PAPERS • NO BENCHMARKS YET

CELLS

CELLS is a large (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of the dataset.

2 PAPERS • NO BENCHMARKS YET

DUC 2007

DUC 2007 (Document Understanding Conferences)

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

2 PAPERS • NO BENCHMARKS YET

DaNewsroom

DaNewsroom (DaNewsroom: A Large-scale Danish Summarisation Dataset)

The first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language.

2 PAPERS • NO BENCHMARKS YET

Elsevier OA CC-BY

An open corpus of Scientific Research papers which has a representative sample from across scientific disciplines. This corpus not only includes the full text of the article, but also the metadata of the documents, along with the bibliographic information for each reference.

2 PAPERS • NO BENCHMARKS YET

HumSet

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 PAPERS • NO BENCHMARKS YET

OASum

OASum is a large-scale open-domain aspect-based summarization dataset which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.

2 PAPERS • NO BENCHMARKS YET

OpenAsp

OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

2 PAPERS • NO BENCHMARKS YET

SuMe

SuMe (A Dataset Towards Summarizing Biomedical Mechanisms)

Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 2

2 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

pn-summary

Pn-summary is a dataset for Persian abstractive text summarization.

2 PAPERS • NO BENCHMARKS YET

BrWac2Wiki

This is a dataset for multi-document summarization in Portuguese, what means that it has examples of multiple documents (input) related to human-written summaries (output). In particular, it has entries of multiple related texts from Brazilian websites about a subject, and the summary is the Portuguese Wikipedia lead section on the same subject (lead: the first section, i.e., summary, of any Wipedia article). Input texts were extracted from BrWac corpus, and the output from Brazilian Wikipedia dumps page.

1 PAPER • NO BENCHMARKS YET

CL-SciSumm

1 PAPER • 2 BENCHMARKS

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

ComSum

ComSum is a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. These messages are gathered and filtered to curate developers' work summarization data set.

1 PAPER • NO BENCHMARKS YET

CrossSum-IN

Given an English article, generate a short summary in the target language.

1 PAPER • NO BENCHMARKS YET

DUC 2006

DUC 2006 (Document Understanding Conferences)

1 PAPER • NO BENCHMARKS YET

FIB (Factual Inconsistency Benchmark)

Factual Inconsistency Benchmark (FIB) is a benchmark that focuses on the task of summarization. Specifically, the benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factual inconsistent summary for an input news article. For factually consistent summaries, human-written reference summaries are used to manually verify as factually consistent.

1 PAPER • NO BENCHMARKS YET

Famous Keyword Twitter Replies

The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:

1 PAPER • NO BENCHMARKS YET

Inshorts News (Inshorts English News dataset)

Inshorts News dataset Inshorts provides a news summary in 60 words or less. Inshorts is a news service that offers short summaries of news from around the web. This dataset contains headlines and a summary of news items and their source.

1 PAPER • 1 BENCHMARK

MOS Dataset (Microblog Opinion Summarisation)

This dataset was used in the paper 'Template-based Abstractive Microblog Opinion Summarisation' (to be published at TACL, 2022). The data is structured as follows: each file represents a cluster of tweets which contains the tweet IDs and a summary of the tweets written by journalists. The gold standard summary follows a template structure and depending on its opinion content, it contains a main story, majority opinion (if any) and/or minority opinions (if any).

1 PAPER • NO BENCHMARKS YET

MentSum

MentSum (Mental Health Summarization Dataset)

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.

1 PAPER • 1 BENCHMARK

MultiSum

MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:

1 PAPER • NO BENCHMARKS YET

PMC-SA

PMC-SA (PMC Structured Abstracts)

PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.

1 PAPER • NO BENCHMARKS YET

Proto Summ

This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.

1 PAPER • NO BENCHMARKS YET

Robust Summarization Evaluation Benchmark

Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.

1 PAPER • NO BENCHMARKS YET

SubSumE

SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.

1 PAPER • NO BENCHMARKS YET

Wiki-en

Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).

1 PAPER • NO BENCHMARKS YET

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 PAPER • NO BENCHMARKS YET

WikiDes

WikiDes is a dataset for generating descriptions of Wikidata from Wikipedia paragraphs.

1 PAPER • NO BENCHMARKS YET

WikiWeb2M (Wikipedia Webpage 2M)

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

1 PAPER • NO BENCHMARKS YET

prompt-opin-summ

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

public_meetings

The public_meetings corpus contains meetings, made of pairs of automatic transcriptions from audio recordings and meeting reports written by a professional. 22 aligned meetings are provided in total.

1 PAPER • NO BENCHMARKS YET

LSARS

LSARS (Large Scale Abstractive multi-Review Summarization)

In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.

0 PAPER • NO BENCHMARKS YET

Datasets

88 dataset results for Text Summarization