Datasets > Modality > Texts > CoNLL-2003

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

The English data was taken from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The preprocessed raw data covers the month of September 1996.

The text for the German data was taken from the ECI Multilingual Text Corpus. This corpus consists of texts in many languages. The portion of data that was used for this task, was extracted from the German newspaper Frankfurter Rundshau. All three of the training, development and test sets were taken from articles written in one week at the end of August 1992. The raw data were taken from the months of September to December 1992.

English data Articles Sentences Tokens LOC MISC ORG PER
Training set 946 14,987 203,621 7140 3438 6321 6600
Development set 216 3,466 51,362 1837 922 1341 1842
Test set 231 3,684 46,435 1668 702 1661 1617

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in English data files.

German data Articles Sentences Tokens LOC MISC ORG PER
Training set 553 12,705 206,931 4363 2288 2427 2773
Development set 201 3,068 51,444 1181 1010 1241 1401
Test set 155 3,160 51,943 1035 670 773 1195

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in German data files.

License

  • Unknown

Modalities

Languages

Tasks