Microsoft Research Social Media Conversation Corpus consists of 127M context-message-response triples from the Twitter FireHose, covering the 3-month period June 2012 through August 2012. Only those triples where context and response were generated by the same user were extracted. To minimize noise, only triples that contained at least one frequent bigram that appeared more than 3 times in the corpus was selected. This produced a corpus of 29M Twitter triples.
6 PAPERS • NO BENCHMARKS YET
CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
5 PAPERS • NO BENCHMARKS YET
A new dataset on the baseball domain.
5 PAPERS • 1 BENCHMARK
PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.
Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.
CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts describes both static and dynamic attributes precisely.
4 PAPERS • NO BENCHMARKS YET
ChatHaruhi is a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues.
The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentence produced by generative models.
4 PAPERS • 1 BENCHMARK
STACKEX expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks.
A multilingual image dataset with spatial relation annotations and object features for image-to-text generation, built using 2,026 images from the PASCAL VOC2008 dataset.
Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.
Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.
DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.
3 PAPERS • NO BENCHMARKS YET
Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.
A dataset of single-sentence edits crawled from Wikipedia.
Dataset Summary CANNOT is a dataset that focuses on negated textual pairs. It currently contains 77,376 samples, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other).
2 PAPERS • NO BENCHMARKS YET
CELLS is a large (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of the dataset.
CLSE is an augmented version of the Schema-Guided Dialog Dataset. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games.
CodeSyntax is a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. It contains 18,701 code samples annotated with 1,342,050 relation edges in 43 relation types for Python, and 13,711 code samples annotated with 864,411 relation edges in 39 relation types for Java. It is designed to evaluate the performance of language models on code syntax understanding.
Concise has two datasets of 2000 sentences each, that were annotated by two and five human annotators, respectively. They are designed for the new task of making sentence concise.
Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).
2 PAPERS • 1 BENCHMARK
DIALOCONAN is a dataset comprising over 3000 fictitious multi-turn dialogues between a hater and an NGO operator, covering 6 targets of hate.
The QTUNA dataset is the result of a series of elicitation experiments in which human speakers were asked to perform a linguistic task that invites the use of quantified expressions in order to inform possible Natural Language Generation algorithms that mimic humans' use of quantified expressions.
The TaoDescribe dataset contains 2,129,187 product titles and descriptions in Chinese.
TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.
WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.
YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.
AAVE/SAE Paired Dataset contains 2019 intent-equivalent AAVE/SAE pairs. The AAVE (African-American Vernacular English) samples are sampled from Blodgett et. al. (2016)'s TwitterAAE, with their corresponding SAE (Standard American English) samples annotated by Amazon MTurk.
1 PAPER • NO BENCHMARKS YET
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
AskParents is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 8,701 samples for training, 802 for validation and 1,091 for testing.
A syllogism is a common form of deductive reasoning that requires precisely two premises and one conclusion. The Avicenna corpus is a benchmark for syllogistic NLI and syllogistic NLG:
COLLIE-v1 is a dataset with 2080 instances comprising 13 constraint structures designed for text generation under constraints. It is a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage).
A large dataset of color names and their respective RGB values stores in CSV.
1 PAPER • 1 BENCHMARK
This dataset consisting 500 set of caption, table and coresponding paper page, processed from DocBank.
DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.
ENT-DESC involves retrieving abundant knowledge of various types of main entities from a large knowledge graph (KG), which makes the current graph-to-sequence models severely suffer from the problems of information loss and parameter explosion while generating the descriptions.
ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.
Food.com Recipes and Interactions consists of 270K recipes and 1.4M user-recipe interactions (reviews) scraped from Food.com, covering a period of 18 years (January 2000 to December 2018).
Ice Hockey News Dataset is a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product.
Image Caption Quality Dataset is a dataset of crowdsourced ratings for machine-generated image captions. It contains more than 600k ratings of image-caption pairs.
The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.
This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)
An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.
The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.
LongForm dataset is created by leveraging English corpus examples with augmented instructions. It contains diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. The examples generated from raw text corpora via LLMs includes structured corpus examples, as well as various NLP task examples such as email writing, grammar error correction, story/poem generation, and text summarization.
MTTN is a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO, Flickr, etc. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations.
Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.