Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not.
162 PAPERS • 1 BENCHMARK
Torque is an English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships.
16 PAPERS • 1 BENCHMARK
TalkDown is a labelled dataset for condescension detection in context. The dataset is derived from Reddit, a set of online communities that is diverse in content and tone. The dataset is built from COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The dataset contains 3,255 positive (condescend) samples and 3,255 negative ones.
7 PAPERS • NO BENCHMARKS YET
A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.
1 PAPER • NO BENCHMARKS YET
The ComMA Dataset v0.2 is a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here (and made available as part of the ComMA@ICON shared task), consists of a total 15,000 annotated comments in four languages - Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English.