3 dataset results for Sentence segmentation

The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.

177 PAPERS • 49 BENCHMARKS

CoNLL 2017 Shared Task - Automatically Annotated Raw Texts and Word Embeddings

Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe (http://ufal.mff.cuni.cz/udpipe), together with word embeddings of dimension 100 computed from lowercased texts by word2vec (https://code.google.com/archive/p/word2vec/).

1 PAPER • NO BENCHMARKS YET

MASC

MASC (Manually Annotated Sub-Corpus)

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).

1 PAPER • NO BENCHMARKS YET

Datasets

3 dataset results for Sentence segmentation