Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model that uses ULMFiT with traditional bag-of-words models that use SVM and Naive Bayes as classifiers. We find the SVM to be competitive, its performance being marginally worse than the ULMFiT while having much faster train and inference time and being less computationally expensive. Finally, we conduct ablation analysis to assess the performance impact of the ULMFiT parts.

PDF

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Text Classification DODF Data ULMFiT (pre-trained vocab, no gradual unfreezing) Weighted F1 0.9257 # 1
Average F1 0.8918 # 1
Text Classification DODF Data ULMFiT (pre-trained vocab) Weighted F1 0.9088 # 2
Average F1 0.8374 # 5
Text Classification DODF Data SVM + word counts (pre-trained vocab) Weighted F1 0.9049 # 3
Average F1 0.8782 # 2
Text Classification DODF Data ULMFiT (no pre-trained vocab) Weighted F1 0.8974 # 4
Average F1 0.8469 # 4
Text Classification DODF Data SVM + tf-idf (no pre-trained vocab) Weighted F1 0.8917 # 5
Average F1 0.8755 # 3

Methods