Character-Based Models for Adversarial Phone Extraction: Preventing Human Sex Trafficking
Illicit activity on the Web often uses noisy text to obscure information between client and seller, such as the seller{'}s phone number. This presents an interesting challenge to language understanding systems; how do we model adversarial noise in a text extraction system? This paper addresses the sex trafficking domain, and proposes some of the first neural network architectures to learn and extract phone numbers from noisy text. We create a new adversarial advertisement dataset, propose several RNN-based models to solve the problem, and most notably propose a visual character language model to interpret unseen unicode characters. We train a CRF jointly with a CNN to improve number recognition by 89{\%} over just a CRF. Through data augmentation in this unique model, we present the first results on characters never seen in training.
PDF Abstract