Search Results for author: Adrien Barbaresi

Found 13 papers, 4 papers with code

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

1 code implementation • ACL 2021 • Adrien Barbaresi

The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

2,857

Paper
Code

Bien choisir son outil d'extraction de contenu \`a partir du Web (Choosing the appropriate tool for Web Content Extraction )

no code implementations • JEPTALNRECITAL 2020 • Ga{\"e}l Lejeune, Adrien Barbaresi

Nous proposons une d{\'e}monstration sur l{'}extraction de contenu textuel dans des pages web ainsi que son {\'e}valuation.

Paper
Add Code

Que rec\`elent les donn\'ees textuelles issues du web ? (What do text data from the Web have to hide ?)

no code implementations • JEPTALNRECITAL 2020 • Adrien Barbaresi, Ga{\"e}l Lejeune

La collecte et l{'}usage opportunistes de donn{\'e}es textuelles tir{\'e}es du web sont sujets {\`a} une s{\'e}rie de probl{\`e}mes {\'e}thiques, m{\'e}thodologiques et {\'e}pist{\'e}mologiques qui m{\'e}ritent l{'}attention de la communaut{\'e} scientifique.

Paper
Add Code

Out-of-the-Box and into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools

no code implementations • LREC 2020 • Adrien Barbaresi, Ga{\"e}l Lejeune

This article examines extraction methods designed to retain the main text content of web pages and discusses how the extraction could be oriented and evaluated: can and should it be as generic as possible to ensure opportunistic corpus construction?

Paper
Add Code

Computationally efficient discrimination between language varieties with large feature vectors and regularized classifiers

no code implementations • COLING 2018 • Adrien Barbaresi

The present contribution revolves around efficient approaches to language classification which have been field-tested in the Vardial evaluation campaign.

General Classification Language Identification +1

Paper
Add Code

A database of German definitory contexts from selected web sources

no code implementations • LREC 2018 • Adrien Barbaresi, Lothar Lemnitzer, Alex Geyken, er

Paper
Add Code

A corpus of German political speeches from the 21st century

no code implementations • LREC 2018 • Adrien Barbaresi

Keyword Extraction Machine Translation

Paper
Add Code

Discriminating between Similar Languages using Weighted Subword Features

1 code implementation • WS 2017 • Adrien Barbaresi

The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task.

Language Identification Text Categorization

Paper
Code

An Unsupervised Morphological Criterion for Discriminating Similar Languages

no code implementations • WS 2016 • Adrien Barbaresi

In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level.

Language Identification Text Categorization