🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

116 dataset results for Text Generation

Microsoft Research Social Media Conversation Corpus

Microsoft Research Social Media Conversation Corpus consists of 127M context-message-response triples from the Twitter FireHose, covering the 3-month period June 2012 through August 2012. Only those triples where context and response were generated by the same user were extracted. To minimize noise, only triples that contained at least one frequent bigram that appeared more than 3 times in the corpus was selected. This produced a corpus of 29M Twitter triples.

6 PAPERS • NO BENCHMARKS YET

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

5 PAPERS • NO BENCHMARKS YET

MLB Dataset

A new dataset on the baseball domain.

5 PAPERS • 1 BENCHMARK

PoMo

PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities.

5 PAPERS • NO BENCHMARKS YET

RiSAWOZ

RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.

5 PAPERS • NO BENCHMARKS YET

Taiga Corpus (An open-source corpus for machine learning.)

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

5 PAPERS • NO BENCHMARKS YET

CelebV-Text

CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts describes both static and dynamic attributes precisely.

4 PAPERS • NO BENCHMARKS YET

ChatHaruhi (ChatHaruhi: Reviving Anime Character in Reality via Large Language Model)

ChatHaruhi is a dataset covering 32 Chinese / English TV / anime characters with over 54k simulated dialogues.

4 PAPERS • NO BENCHMARKS YET

RuCoLA

The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentence produced by generative models.

4 PAPERS • 1 BENCHMARK

STACKEX

STACKEX expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks.

4 PAPERS • NO BENCHMARKS YET

SpatialVOC2K

A multilingual image dataset with spatial relation annotations and object features for image-to-text generation, built using 2,026 images from the PASCAL VOC2008 dataset.

4 PAPERS • NO BENCHMARKS YET

Wikipedia Generation

Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.

4 PAPERS • NO BENCHMARKS YET

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

4 PAPERS • 1 BENCHMARK

DR.BENCH (Diagnostic Reasoning Benchmark for clinical natural language processing)

DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.

3 PAPERS • NO BENCHMARKS YET

Goal

Goal is a novel dataset of football (or 'soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding.

3 PAPERS • NO BENCHMARKS YET

WikiDocEdits

A dataset of single-sentence edits crawled from Wikipedia.

3 PAPERS • NO BENCHMARKS YET

CANNOT (Compilation of ANnotated, Negation-Oriented Text-pairs)

Dataset Summary CANNOT is a dataset that focuses on negated textual pairs. It currently contains 77,376 samples, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other).

2 PAPERS • NO BENCHMARKS YET

CELLS

CELLS is a large (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of the dataset.

2 PAPERS • NO BENCHMARKS YET

CLSE (Corpus of Linguistically Significant Entities)

CLSE is an augmented version of the Schema-Guided Dialog Dataset. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games.

2 PAPERS • NO BENCHMARKS YET

CodeSyntax

CodeSyntax is a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. It contains 18,701 code samples annotated with 1,342,050 relation edges in 43 relation types for Python, and 13,711 code samples annotated with 864,411 relation edges in 39 relation types for Java. It is designed to evaluate the performance of language models on code syntax understanding.

2 PAPERS • NO BENCHMARKS YET

Concise

Concise has two datasets of 2000 sentences each, that were annotated by two and five human annotators, respectively. They are designed for the new task of making sentence concise.

2 PAPERS • NO BENCHMARKS YET

Czech restaurant information

Czech restaurant information is a dataset for NLG in task-oriented spoken dialogue systems with Czech as the target language. It originated as a translation of the English San Francisco Restaurants dataset by Wen et al. (2015).

2 PAPERS • 1 BENCHMARK

DIALOCONAN

DIALOCONAN is a dataset comprising over 3000 fictitious multi-turn dialogues between a hater and an NGO operator, covering 6 targets of hate.

2 PAPERS • NO BENCHMARKS YET

QTuna

The QTUNA dataset is the result of a series of elicitation experiments in which human speakers were asked to perform a linguistic task that invites the use of quantified expressions in order to inform possible Natural Language Generation algorithms that mimic humans' use of quantified expressions.

2 PAPERS • NO BENCHMARKS YET

TaoDescribe

The TaoDescribe dataset contains 2,129,187 product titles and descriptions in Chinese.

2 PAPERS • NO BENCHMARKS YET

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

2 PAPERS • NO BENCHMARKS YET

WebLINX (Real-World Website Navigation with Multi-Turn)

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.

2 PAPERS • 1 BENCHMARK

YTD-18M

YTD-18M is a large-scale corpus of 18M video-based dialogues, constructed from web videos: crucial to the data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning.

2 PAPERS • NO BENCHMARKS YET

AAVE/SAE Paired Dataset

AAVE/SAE Paired Dataset contains 2019 intent-equivalent AAVE/SAE pairs. The AAVE (African-American Vernacular English) samples are sampled from Blodgett et. al. (2016)'s TwitterAAE, with their corresponding SAE (Standard American English) samples annotated by Amazon MTurk.

1 PAPER • NO BENCHMARKS YET

Alpaca Data Galician

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

AskParents

AskParents is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 8,701 samples for training, 802 for validation and 1,091 for testing.

1 PAPER • NO BENCHMARKS YET

Avicenna: Deductive Commonsense Reasoning

A syllogism is a common form of deductive reasoning that requires precisely two premises and one conclusion. The Avicenna corpus is a benchmark for syllogistic NLI and syllogistic NLG:

1 PAPER • NO BENCHMARKS YET

COLLIE-v1

COLLIE-v1 is a dataset with 2080 instances comprising 13 constraint structures designed for text generation under constraints. It is a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage).

1 PAPER • NO BENCHMARKS YET

Colors

A large dataset of color names and their respective RGB values stores in CSV.

1 PAPER • 1 BENCHMARK

DocBank-TB

DocBank-TB (DocBank-Table)

This dataset consisting 500 set of caption, table and coresponding paper page, processed from DocBank.

1 PAPER • NO BENCHMARKS YET

DpgMedia2019

DpgMedia2019 is a Dutch news dataset for partisanship detection. It contains more than 100K articles that are labelled on the publisher level and 776 articles that were crowdsourced using an internal survey platform and labelled on the article level.

1 PAPER • NO BENCHMARKS YET

ENT-DESC

ENT-DESC involves retrieving abundant knowledge of various types of main entities from a large knowledge graph (KG), which makes the current graph-to-sequence models severely suffer from the problems of information loss and parameter explosion while generating the descriptions.

1 PAPER • 1 BENCHMARK

ExHVV

ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.

1 PAPER • NO BENCHMARKS YET

Food.com Recipes and Interactions

Food.com Recipes and Interactions consists of 270K recipes and 1.4M user-recipe interactions (reviews) scraped from Food.com, covering a period of 18 years (January 2000 to December 2018).

1 PAPER • NO BENCHMARKS YET

Ice Hockey News Dataset

Ice Hockey News Dataset is a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product.

1 PAPER • NO BENCHMARKS YET

Image Caption Quality Dataset

Image Caption Quality Dataset is a dataset of crowdsourced ratings for machine-generated image captions. It contains more than 600k ratings of image-caption pairs.

1 PAPER • NO BENCHMARKS YET

Lenta Short Sentences

The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.

1 PAPER • NO BENCHMARKS YET

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)

1 PAPER • 1 BENCHMARK

LitMind Dictionary

An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.

1 PAPER • NO BENCHMARKS YET

Live Comment Dataset

The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.

1 PAPER • NO BENCHMARKS YET

LongForm

LongForm dataset is created by leveraging English corpus examples with augmented instructions. It contains diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. The examples generated from raw text corpora via LLMs includes structured corpus examples, as well as various NLP task examples such as email writing, grammar error correction, story/poem generation, and text summarization.

1 PAPER • NO BENCHMARKS YET

MTTN

MTTN is a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO, Flickr, etc. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations.

1 PAPER • NO BENCHMARKS YET

Machine_Mindset_MBTI_dataset

Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.

1 PAPER • NO BENCHMARKS YET

Datasets

116 dataset results for Text Generation