Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.
162 PAPERS • 1 BENCHMARK
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
4 PAPERS • 3 BENCHMARKS
Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.
3 PAPERS • NO BENCHMARKS YET
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
1 PAPER • NO BENCHMARKS YET
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Wiki-zh is an annotated Chinese dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN). It contains 26,280 documents split into training, validation and test.