This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.
4 PAPERS • 1 BENCHMARK
RETWEET is a dataset of tweets and overall predominant sentiment of their replies.
2 PAPERS • 1 BENCHMARK
#chinahate dataset contains a total of 2,172,333 tweets hashtagged #china posted during the time it was collected. It is designed for the task of hate speech detection.
1 PAPER • NO BENCHMARKS YET
Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.
The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.
SentimentArcs’ reference corpus for novels consists of 25 narratives selected to create a diverse set of well recognized novels that can serve as a benchmark for future studies. The composition of the corpora was limited by the effect of copyright laws as well as historical imbalances. Most works were obtained from US and Australian Gutenberg Projects. The corpora is expected to grow in size and diversity over time.