Accurate clinical and biomedical Named entity recognition at scale
We introduce an agile, production-grade clinical and biomedical Named entity recognition (NER) algorithm based on a modified BiLSTM-CNN-Char DL architecture built on top of Apache Spark. Our NER implementation establishes new state-of-the-art accuracy on 7 of 8 well-known biomedical NER benchmarks and 3 clinical concept extraction challenges: 2010 i2b2/VA clinical concept extraction, 2014 n2c2 de-identification, and 2018 n2c2 medication extraction. Moreover, clinical NER models trained using this implementation outperform the accuracy of commercial entity extraction solutions, AWS Medical Comprehend and Google Cloud Healthcare API by a large margin (8.9% and 6.7% respectively), without using memory-intensive language models.
PDF AbstractCode
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Named Entity Recognition (NER) | AnatEM | BertForTokenClassification (Spark NLP) | F1 | 91.65 | # 5 | |
Named Entity Recognition (NER) | BC4CHEMD | BertForTokenClassification (Spark NLP) | F1 | 94.39 | # 5 | |
Named Entity Recognition (NER) | BC5CDR | BertForTokenClassification (Spark NLP) | F1 | 90.89 | # 5 | |
Named Entity Recognition (NER) | BioNLP13-CG | BertForTokenClassification (Spark NLP) | F1 | 87.83 | # 2 | |
Named Entity Recognition (NER) | Species800 | BertForTokenClassification (Spark NLP) | F1 | 82.59 | # 2 |