A Large Self-Annotated Corpus for Sarcasm

LREC 2018  ยท  Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli ยท

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

PDF Abstract LREC 2018 PDF LREC 2018 Abstract

Datasets


Introduced in the Paper:

SARC

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Sarcasm Detection SARC (all-bal) Bag-of-Bigrams Accuracy 75.8 # 2
Sarcasm Detection SARC (pol-bal) Bag-of-Bigrams Accuracy 76.5 # 1
Sarcasm Detection SARC (pol-unbal) Bag-of-Words Avg F1 27.0 # 1

Methods


No methods listed for this paper. Add relevant methods here