Classifying Illegal Activities on Tor Network Based on Web Textual Contents
The freedom of the Deep Web offers a safe place where people can express themselves anonymously but they also can conduct illegal activities. In this paper, we present and make publicly available a new dataset for Darknet active domains, which we call {''}Darknet Usage Text Addresses{''} (DUTA). We built DUTA by sampling the Tor network during two months and manually labeled each address into 26 classes. Using DUTA, we conducted a comparison between two well-known text representation techniques crossed by three different supervised classifiers to categorize the Tor hidden services. We also fixed the pipeline elements and identified the aspects that have a critical influence on the classification results. We found that the combination of TFIDF words representation with Logistic Regression classifier achieves 96.6{\%} of 10 folds cross-validation accuracy and a macro F1 score of 93.7{\%} when classifying a subset of illegal activities from DUTA. The good performance of the classifier might support potential tools to help the authorities in the detection of these activities.
PDF Abstract