Inferring the source of official texts: can SVM beat ULMFiT?
Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model that uses ULMFiT with traditional bag-of-words models that use SVM and Naive Bayes as classifiers. We find the SVM to be competitive, its performance being marginally worse than the ULMFiT while having much faster train and inference time and being less computationally expensive. Finally, we conduct ablation analysis to assess the performance impact of the ULMFiT parts.
PDFDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Text Classification | DODF Data | ULMFiT (pre-trained vocab, no gradual unfreezing) | Weighted F1 | 0.9257 | # 1 | |
Average F1 | 0.8918 | # 1 | ||||
Text Classification | DODF Data | ULMFiT (pre-trained vocab) | Weighted F1 | 0.9088 | # 2 | |
Average F1 | 0.8374 | # 5 | ||||
Text Classification | DODF Data | SVM + word counts (pre-trained vocab) | Weighted F1 | 0.9049 | # 3 | |
Average F1 | 0.8782 | # 2 | ||||
Text Classification | DODF Data | ULMFiT (no pre-trained vocab) | Weighted F1 | 0.8974 | # 4 | |
Average F1 | 0.8469 | # 4 | ||||
Text Classification | DODF Data | SVM + tf-idf (no pre-trained vocab) | Weighted F1 | 0.8917 | # 5 | |
Average F1 | 0.8755 | # 3 |