🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

136 dataset results for Text Classification

A large-scale curated dataset of over 152 million tweets, growing daily, related to COVID-19 chatter generated from January 1st to April 4th at the time of writing.

10 PAPERS • 6 BENCHMARKS

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 sentences.

10 PAPERS • 1 BENCHMARK

JGLUE

JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.

7 PAPERS • NO BENCHMARKS YET

TUNIZI

A sentiment analysis Tunisian Arabizi Dataset, collected from social networks, preprocessed for analytical studies and annotated manually by Tunisian native speakers.

7 PAPERS • NO BENCHMARKS YET

OMICS (Open Mind Indoor Common Sense)

OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.

6 PAPERS • NO BENCHMARKS YET

TREC-10

TREC-10 (TREC-10 Question Classification)

A question type classification dataset with 6 classes for questions about a person, location, numeric information, etc. The test split has 500 questions, and the training split has 5452 questions.

6 PAPERS • 2 BENCHMARKS

Taiga Corpus (An open-source corpus for machine learning.)

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

5 PAPERS • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

TArC

A morpho-syntactically annotated Tunisian Arabish Corpus (TArC).

4 PAPERS • NO BENCHMARKS YET

Twitter Sentiment Analysis (Entity-Level Twitter Sentiment Analysis Dataset)

This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.

4 PAPERS • 1 BENCHMARK

UIT-VSMEC

UIT-VSMEC (Vietnamese Social Media Emotion Corpus)

Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model

4 PAPERS • NO BENCHMARKS YET

arXiv-10

Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math.

4 PAPERS • 1 BENCHMARK

FreSaDa

FreSaDa is a French satire dataset for cross-domain satire detection, which is composed of 11,570 articles from the news domain. The dataset samples have been split into training, validation and test, such that the training publication sources are distinct from the validation and test publication sources. This gives rise to a cross-domain (cross-source) satire detection task.

3 PAPERS • NO BENCHMARKS YET

Hyperpartisan News Detection

Hyperpartisan News Detection was a dataset created for PAN @ SemEval 2019 Task 4. Given a news article text, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

3 PAPERS • 1 BENCHMARK

MN-DS

MN-DS (Multilabeled News Dataset)

Multilabeled News Dataset (MN-DS) is a dataset for news classification. It consists of 10,917 articles in 17 first-level and 109 second-level categories from 215 media sources.

3 PAPERS • NO BENCHMARKS YET

Medical Abstracts

Medical Abstracts (Medical Abstracts Text Classification Dataset)

The Medical Abstracts dataset contains 14,438 medical abstracts describing 5 different classes of patient conditions, with all of the dataset being annotated. The dataset is split into training and test sets.

3 PAPERS • 1 BENCHMARK

MuLD (Multitask Long Document Benchmark)

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

3 PAPERS • 6 BENCHMARKS

SV-Ident (Survey Variable Identification)

SV-Ident comprises 4,248 sentences from social science publications in English and German. The data is the official data for the Shared Task: “Survey Variable Identification in Social Science Publications” (SV-Ident) 2022. Sentences are labeled with variables that are mentioned either explicitly or implicitly.

3 PAPERS • 2 BENCHMARKS

TCAB

TCAB (Text Classification Attack Benchmark)

Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attack targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed.

3 PAPERS • NO BENCHMARKS YET

Wikipedia Title

Wikipedia Title is a dataset for learning character-level compositionality from the character visual characteristics. It consists of a collection of Wikipedia titles in Chinese, Japanese or Korean labelled with the category to which the article belongs.

3 PAPERS • NO BENCHMARKS YET

AG’s Corpus

AG’s Corpus (AG's corpus of news articlesNews)

Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.

2 PAPERS • NO BENCHMARKS YET

Biographical

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 PAPERS • NO BENCHMARKS YET

CIC

CIC (Catalonia Independence Corpus)

The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.

2 PAPERS • 3 BENCHMARKS

HLGD

HLGD (Headline Grouping Dataset)

The Headline Grouping dataset is a binary classification dataset on pairs of news headline. For each pair of headline, the binary label indicates whether the two headlines are part of the same group (and describe the same underlying event), or whether they are in distinct groups. The dataset contains a total of 20k annotated headline pairs, further split in a train, validation and test portions.

2 PAPERS • NO BENCHMARKS YET

Illness-dataset

Illness-dataset (Illness multi-domain textual dataset)

A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.

2 PAPERS • NO BENCHMARKS YET

KanHope

KanHope (Kannada Hope speech dataset)

KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The dataset consists of 6,176 user-generated comments in code mixed Kannada crawled from YouTube and manually labelled as bearing hope speech or not-hope speech.

2 PAPERS • 1 BENCHMARK

Law Stack Exchange

Description Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation" (Li et al., 2022). We introduce a dataset with data from the Law Stack Exchange. This dataset is composed of questions from the Law Stack Exchange, which is a community forum-based website containing questions with answers to legal questions. We link the questions with their associated tags (e.g., "copyright" or "criminal-law"), and perform a multi-label classification task

2 PAPERS • NO BENCHMARKS YET

RSDD-Time

RSDD-Time is a dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Additionally, the dataset includes exact temporal spans that relate to the date of diagnosis.

2 PAPERS • NO BENCHMARKS YET

SDCNL (Suicide vs Depression Classification)

We develop a primary dataset based on our task of suicide or depression classification. This dataset is web-scraped from Reddit. We collect our data from subreddits using the Python Reddit API. We specifically scrape from two subreddits, r/SuicideWatch3 and r/Depression. The dataset contains 1,895 total posts. We utilize two fields from the scraped data: the original text of the post as our inputs, and the subreddit it belongs to as labels. Posts from r/SuicideWatch are labeled as suicidal, and posts from r/Depression are labeled as depressed. We make this dataset and the web-scraping script available in our code.

2 PAPERS • NO BENCHMARKS YET

Social media attributions of YouTube comments

Social media attributions of YouTube comments (Social media attributions dataset of YouTube comments in the context of water crisis)

Data set constructed from YouTube comments (72,098 comments posted by 43,859 users on 623 relevant videos to the crisis)

2 PAPERS • 1 BENCHMARK

The 'Call me sexist but' Dataset (CMSB)

Tweets and items from psychological scales for sexism detection with counterfactual examples.

2 PAPERS • NO BENCHMARKS YET

Twitter Stance Election 2020

The data set contains 2500 manually-stance-labeled tweets, 1250 for each candidate (Joe Biden and Donald Trump). These tweets were sampled from the unlabeled set that our research team collected English tweets related to the 2020 US Presidential election. Through the Twitter Streaming API, the authors collected data using election-related hashtags and keywords. Between January 2020 and September 2020, over 5 million tweets were collected, not including quotes and retweets.

2 PAPERS • 2 BENCHMARKS

AI-GA: AI-Generated Abstracts dataset

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of abstracts and titles, with half of the abstracts being AI-generated and the other half being original. This dataset is designed to be used for research and experimentation in the field of natural language processing, particularly in the context of language generation and machine learning.

1 PAPER • NO BENCHMARKS YET

AI-generated Twitter Timelines

This project contains instructions and codes to reconstruct a dataset for the development and evaluation of forensic tools for detecting machine-generated text in social media.

1 PAPER • NO BENCHMARKS YET

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.

1 PAPER • 1 BENCHMARK

Avicenna: Deductive Commonsense Reasoning

A syllogism is a common form of deductive reasoning that requires precisely two premises and one conclusion. The Avicenna corpus is a benchmark for syllogistic NLI and syllogistic NLG:

1 PAPER • NO BENCHMARKS YET

BanglaEmotion

BanglaEmotion (BanglaEmotion: A Benchmark Dataset for Bangla Textual Emotion Analysis)

BanglaEmotion is a manually annotated Bangla Emotion corpus, which incorporates the diversity of fine-grained emotion expressions in social-media text. More fine-grained emotion labels are considered such as Sadness, Happiness, Disgust, Surprise, Fear and Anger - which are, according to Paul Ekman (1999), the six basic emotion categories. For this task, a large amount of raw text data are collected from the user’s comments on two different Facebook groups (Ekattor TV and Airport Magistrates) and from the public post of a popular blogger and activist Dr. Imran H Sarker. These comments are mostly reactions to ongoing socio-political issues and towards the economic success and failure of Bangladesh. A total of 32923 comments are scraped from the three sources aforementioned above. Out of these, a total of 6314 comments were annotated into the six categories. The distribution of the annotated corpus is as follows:

1 PAPER • NO BENCHMARKS YET

BugClassify

Dataset of 5,591 labeled issue tickets. Originally created by Herzig et al. in : "It’s Not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction" (paper)

1 PAPER • NO BENCHMARKS YET

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

Cards Against Humanity

A dataset of games played in the card game "Cards Against Humanity" (CAH), by human players, derived from the online CAH labs. Each round includes the cards presented to users - a "black" prompt with a blank or question and 10 "white" punchlines as possible responses, and which punchline was picked by a player each round, along with text and metadata.

1 PAPER • NO BENCHMARKS YET

CareerCoach 2022

The CareerCoach 2022 gold standard is available for download in the NIF and JSON format, and draws upon documents from a corpus of over 99,000 education courses which have been retrieved from 488 different education providers.

1 PAPER • NO BENCHMARKS YET

Caselaw4

Caselaw4 is a dataset of 350k common law judicial decisions from the U.S. Caselaw Access Project, of which 250k have been automatically annotated with binary outcome labels of AFFIRM and REVERSE.

1 PAPER • NO BENCHMARKS YET

Chinese AI and Law (CAIL) 2018

Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.

1 PAPER • NO BENCHMARKS YET

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 PAPER • NO BENCHMARKS YET

FMC-MWO2KG

FMC-MWO2KG (The MWO2KG Failure Mode Classification Dataset)

The Failure Mode Classification dataset released in the paper "MWO2KG and Echidna: Constructing and exploring knowledge graphs from maintenance data" by Stewart et al. The goal is to label a given observation (made by a maintainer) with the corresponding Failure Mode Code.

1 PAPER • 1 BENCHMARK

FTR-18

FTR-18 is a multilingual rumour dataset on football transfer news. Transfer rumours are continuously published by sports media. They can both harm the image of player or a club or increase the player's market value. The proposed dataset includes transfer articles written in English, Spanish and Portuguese. It also comprises Twitter reactions related to the transfer rumours. FTR-18 is suited for rumour classification tasks and allows the research on the linguistic patterns used in sports journalism.

1 PAPER • NO BENCHMARKS YET

FedNLP

FedNLP (FOMC Docs and Speeches)

We collect the various forms of Federal Reserve communications.

1 PAPER • NO BENCHMARKS YET

Forex News Annotated Dataset for Sentiment Analysis

This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

1 PAPER • NO BENCHMARKS YET

Datasets

136 dataset results for Text Classification