WikiWeb2M (Wikipedia Webpage 2M)

Introduced by Burns et al. in WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

Source: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages