CoNLL 2003

Introduced by Sang et al. in Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

The English data was taken from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The preprocessed raw data covers the month of September 1996.

The text for the German data was taken from the ECI Multilingual Text Corpus. This corpus consists of texts in many languages. The portion of data that was used for this task, was extracted from the German newspaper Frankfurter Rundshau. All three of the training, development and test sets were taken from articles written in one week at the end of August 1992. The raw data were taken from the months of September to December 1992.

English data	Articles	Sentences	Tokens	LOC	MISC	ORG	PER
Training set	946	14,987	203,621	7140	3438	6321	6600
Development set	216	3,466	51,362	1837	922	1341	1842
Test set	231	3,684	46,435	1668	702	1661	1617

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in English data files.

German data	Articles	Sentences	Tokens	LOC	MISC	ORG	PER
Training set	553	12,705	206,931	4363	2288	2427	2773
Development set	201	3,068	51,444	1181	1010	1241	1401
Test set	155	3,160	51,943	1035	670	773	1195

Number of articles, sentences, tokens and entities (locations, miscellaneous, organizations, and persons) in German data files.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Token Classification	conll2003	microsoft-deberta-v3-large_ner_conll2003
Named Entity Recognition (NER)	CoNLL 2003 (English)	ACE + document-context
Named Entity Recognition (NER)	CoNLL 2003 (German)	ACE + document-context
Named Entity Recognition (NER)	CoNLL 2003 (German) Revised	FLERT XLM-R
Named Entity Recognition (NER)	CoNLL03	UniNER-7B
Cross-Lingual NER	CoNLL 2003	XLM-RoBERTa-large
Chunking	CoNLL 2003 (German)	ACE
Chunking	CoNLL 2003 (English)	ACE
Named Entity Recognition	CoNLL 2003 (English)	BERT-LARGE
Chunking	CoNLL 2003	Def2Vec
Low Resource Named Entity Recognition	CONLL 2003 German	Zero-Resource Transfer From CoNLL-2003 English dataset.
FG-1-PG-1	conll2003	CFNER
Named Entity Recognition	CoNLL 2003	gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner
POS	CoNLL 2003	Def2Vec
NER	CoNLL 2003	Def2Vec
Weakly-Supervised Named Entity Recognition	CoNLL03	BOND