LREC 2020

FlauBERT: Unsupervised Language Model Pre-training for French

LREC 2020 huggingface/transformers

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks.

LANGUAGE MODELLING NATURAL LANGUAGE INFERENCE TEXT CLASSIFICATION WORD SENSE DISAMBIGUATION

NorNE: Annotating Named Entities for Norwegian

LREC 2020 juand-r/entity-recognition-datasets

This paper presents NorNE, a manually annotated corpus of named entities which extends the annotation of the existing Norwegian Dependency Treebank.

TableBank: A Benchmark Dataset for Table Detection and Recognition

LREC 2020 doc-analysis/TableBank

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet.

TABLE DETECTION

word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs

LREC 2020 Kyubyong/word2word

We wrap our dataset and model in an easy-to-use Python library, which supports downloading and retrieving top-k word translations in any of the supported language pairs as well as computing top-k word translations for custom parallel corpora.

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

LREC 2020 facebookresearch/cc_net

Pre-training text representations have led to significant improvements in many areas of natural language processing.

An Annotated Dataset of Coreference in English Literature

LREC 2020 dbamman/litbank

We present in this work a new dataset of coreference annotations for works of literature in English, covering 29, 103 mentions in 210, 532 tokens from 100 works of fiction.

COREFERENCE RESOLUTION

Common Voice: A Massively-Multilingual Speech Corpus

LREC 2020 facebookresearch/covost

To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages.

LANGUAGE IDENTIFICATION SPEECH RECOGNITION TRANSFER LEARNING

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

LREC 2020 facebookresearch/covost

Spoken language translation has recently witnessed a resurgence in popularity, thanks to the development of end-to-end models and the creation of new corpora, such as Augmented LibriSpeech and MuST-C.

AraBERT: Transformer-based Model for Arabic Language Understanding

LREC 2020 aub-mind/araBERT

Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus.

NAMED ENTITY RECOGNITION QUESTION ANSWERING SENTIMENT ANALYSIS