This is a second public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,232,186 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources. A 13% increase in available articles over the 08.2017 release.
The dataset is segmented in 3 different subsets, each corresponding to a severity level of the LaTeXML software responsible for the HTML5 conversion.
derivative word embeddings and a token model are available separately here
Paper | Code | Results | Date | Stars |
---|