8TAGS

Introduced by Dadas et al. in Evaluation of Sentence Representations in Polish

The 8TAGS dataset is a corpus specifically created for the evaluation of sentence representations in Polish. It consists of approximately 50,000 sentences annotated with eight topic labels, including film, history, food, medicine, motorization, work, sport, and technology. The dataset was automatically generated by extracting sentences from headlines and short descriptions of articles posted on the Polish social networking site wykop.pl. The corpus contains cleaned and tokenized, unambiguous sentences, each tagged with only one of the selected categories and longer than 30 characters. The classification accuracy is reported for this dataset as a part of the evaluation of sentence representations in Polish.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


License


  • Unknown

Modalities


Languages