The 8TAGS dataset is a corpus specifically created for the evaluation of sentence representations in Polish. It consists of approximately 50,000 sentences annotated with eight topic labels, including film, history, food, medicine, motorization, work, sport, and technology. The dataset was automatically generated by extracting sentences from headlines and short descriptions of articles posted on the Polish social networking site wykop.pl. The corpus contains cleaned and tokenized, unambiguous sentences, each tagged with only one of the selected categories and longer than 30 characters. The classification accuracy is reported for this dataset as a part of the evaluation of sentence representations in Polish.
Paper | Code | Results | Date | Stars |
---|