WebBrain-Raw is a large-scale dataset built from English Wikipedia articles and their crawlable Wikipedia references. It comprises 153 zipped data chunks in which each line is a Wikipedia page with its reference articles.
1 PAPER • NO BENCHMARKS YET
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
We provide a new data set XWikiRef for the task of Cross-lingual Multi-document Summarization. This task aims at generating Wikipedia style text in Low Resource languages by taking reference text as input. Overall, the data set contains 8 different languages: bengali (bn), english (en), hindi (hi), marathi (mr), malayalam (ml), odia (or), punjabi (pa) and tamil (ta). It also contains 5 domains: books, films, politicians, sportsman and writers.
1 PAPER • 1 BENCHMARK
needadvice is a dataset for advice classification extracted from Reddit. In this dataset, posts are annotated for whether they contain advice or not. It contains 6,148 samples for training, 816 for validation and 898 for testing.