52 dataset results for Text Generation AND English

WebBrain-Raw is a large-scale dataset built from English Wikipedia articles and their crawlable Wikipedia references. It comprises 153 zipped data chunks in which each line is a Wikipedia page with its reference articles.

1 PAPER • NO BENCHMARKS YET

WikiWeb2M (Wikipedia Webpage 2M)

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

1 PAPER • NO BENCHMARKS YET

XWikiRef

We provide a new data set XWikiRef for the task of Cross-lingual Multi-document Summarization. This task aims at generating Wikipedia style text in Low Resource languages by taking reference text as input. Overall, the data set contains 8 different languages: bengali (bn), english (en), hindi (hi), marathi (mr), malayalam (ml), odia (or), punjabi (pa) and tamil (ta). It also contains 5 domains: books, films, politicians, sportsman and writers.

1 PAPER • 1 BENCHMARK

needadvice

needadvice is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 6,148 samples for training, 816 for validation and 898 for testing.

1 PAPER • NO BENCHMARKS YET

Datasets

52 dataset results for Text Generation AND English