Scientific statement classification dataset from arXMLiv 08.2018

Introduced by Ginev et al. in Scientific Statement Classification over arXiv.org

This resource contains 10.5 million paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).

The annotated statement dataset is derived from arXMLiv, a machine-readable HTML5 representation of the arXiv corpus of scientific articles.

Examples

Definition with math lexemes (main data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
where
  caligraphic_H caligraphic_K and caligraphic_L
are finite dimensional hilbert spaces over the complex field blackboard_C and
  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
    caligraphic_H MULOP_tensor_product caligraphic_L
is an isometry in fdhilb

source: definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt

Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  where and are finite dimensional hilbert spaces over the complex field and
  is an isometry in fdhilb

nomath source: definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages