arXMLiv:08.2018

This is a second public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,232,186 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources. A 13% increase in available articles over the 08.2017 release.

The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.

derivative word embeddings and a token model are available separately here

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages