Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe
2 PAPERS • NO BENCHMARKS YET
OASum is a large-scale open-domain aspect-based summarization dataset which contains more than 3.7 million instances with around 1 million different aspects on 2 million Wikipedia pages.
OpenAsp Dataset OpenAsp is an Open Aspect-based Multi-Document Summarization dataset derived from DUC and MultiNews summarization datasets.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
Pn-summary is a dataset for Persian abstractive text summarization.
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
1 PAPER • NO BENCHMARKS YET
ComSum is a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. These messages are gathered and filtered to curate developers' work summarization data set.
Given an English article, generate a short summary in the target language.
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
Factual Inconsistency Benchmark (FIB) is a benchmark that focuses on the task of summarization. Specifically, the benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factual inconsistent summary for an input news article. For factually consistent summaries, human-written reference summaries are used to manually verify as factually consistent.
The "Famous Keyword Twitter Replies Dataset" is a comprehensive collection of Twitter data that focuses on popular keywords and their associated replies. This dataset contains five essential columns that provide valuable insights into the Twitter conversation dynamics:
Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. While posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits.
1 PAPER • 1 BENCHMARK
MultiSum is a dataset for multimodal summarization (MSMO). It consists of 17 categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. The dataset features:
PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.
This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.
Robust Summarization Evaluation Benchmark is a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets.
SubSumE Dataset This repository contains the SubSumE dataset for subjective document summarization. See the paper and the talk for details on dataset creation. Also check out our work SuDocu on example-based document summarization.
Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The public_meetings corpus contains meetings, made of pairs of automatic transcriptions from audio recordings and meeting reports written by a professional. 22 aligned meetings are provided in total.
In an active e-commerce environment, customers process a large number of reviews when deciding on whether to buy a product or not. Abstractive Multi-Review Summarization aims to assist users to efficiently consume the reviews that are the most relevant to them. We propose the first large-scale abstractive multi-review summarization dataset that leverages more than 17.9 billion raw reviews and uses novel aspect-alignment techniques based on aspect annotations. Furthermore, we demonstrate that one can generate higher-quality review summaries by using a novel aspect-alignment-based model. Results from both automatic and human evaluation show that the proposed dataset plus the innovative aspect-alignment model can generate high-quality and trustful review summaries.
0 PAPER • NO BENCHMARKS YET