🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

71 dataset results for Text Summarization AND Texts

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 PAPERS • NO BENCHMARKS YET

OASum

OASum is a large-scale open-domain aspect-based summarization dataset which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.

2 PAPERS • NO BENCHMARKS YET

OpenAsp

OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.

2 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

pn-summary

Pn-summary is a dataset for Persian abstractive text summarization.

2 PAPERS • NO BENCHMARKS YET

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

ComSum

ComSum is a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. These messages are gathered and filtered to curate developers' work summarization data set.

1 PAPER • NO BENCHMARKS YET

CrossSum-IN

Given an English article, generate a short summary in the target language.

1 PAPER • NO BENCHMARKS YET

DUC 2006

DUC 2006 (Document Understanding Conferences)

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

1 PAPER • NO BENCHMARKS YET

FIB (Factual Inconsistency Benchmark)

Factual Inconsistency Benchmark (FIB) is a benchmark that focuses on the task of summarization. Specifically, the benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factual inconsistent summary for an input news article. For factually consistent summaries, human-written reference summaries are used to manually verify as factually consistent.

1 PAPER • NO BENCHMARKS YET

Famous Keyword Twitter Replies

The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:

1 PAPER • NO BENCHMARKS YET

MentSum

MentSum (Mental Health Summarization Dataset)

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.

1 PAPER • 1 BENCHMARK

MultiSum

MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:

1 PAPER • NO BENCHMARKS YET

PMC-SA

PMC-SA (PMC Structured Abstracts)

PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.

1 PAPER • NO BENCHMARKS YET

Proto Summ

This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.

1 PAPER • NO BENCHMARKS YET

Robust Summarization Evaluation Benchmark

Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.

1 PAPER • NO BENCHMARKS YET

SubSumE

SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.

1 PAPER • NO BENCHMARKS YET

Wiki-en

Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).

1 PAPER • NO BENCHMARKS YET

Wiki-zh

Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.

1 PAPER • NO BENCHMARKS YET

WikiWeb2M (Wikipedia Webpage 2M)

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

1 PAPER • NO BENCHMARKS YET

prompt-opin-summ

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

public_meetings

The public_meetings corpus contains meetings, made of pairs of automatic transcriptions from audio recordings and meeting reports written by a professional. 22 aligned meetings are provided in total.

1 PAPER • NO BENCHMARKS YET

LSARS

LSARS (Large Scale Abstractive multi-Review Summarization)

In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.

0 PAPER • NO BENCHMARKS YET

Datasets

71 dataset results for Text Summarization AND Texts