The AROT-COV23 (ARabic Original Tweets on COVID-19 as of 2023) dataset is a large-scale collection of original Arabic tweets related to COVID-19, spanning from January 2020 to January 2023, and the period for which we collected the data runs from January 1, 2020 to January 5, 2023. The dataset contains approximately 500,000 original tweets, providing a rich source of information on how Arabic-speaking Twitter users have discussed and shared information about the pandemic. For more details on this dataset, please see the paper in the citation section below.

The tweets in the AROT-COV23 dataset were collected using a set of COVID-19-related keywords in Arabic. The tweets were then filtered to ensure that they were written in Arabic. We selected three keywords related to COVID-19 for the data request, the details are in the table below.

Keyword Description
COVID-19 Coronavirus disease 2019 (COVID-19) was released by the World Health Organization as the most common name for this disease.
مرض فيروس كورونا This is Arabic for "Coronavirus Disease" (COVID-19)
فيروس كوفيد This is Arabic for "Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)"

Features

The dataset has the following features:

⚠️ Please note that due to the restrictions imposed by Twitter's Developer Agreement and Policy on Content redistribution, the data that we make public available does not comprise direct tweet text data and user privacy data.

Field Type Description
tweet id string The unique identifier of the requested Tweet.
author id string The unique identifier of this user.
created_at date Creation time of the Tweet.
lang string Language of the Tweet, if detected by Twitter.
like_count int The number of likes on this tweet
quote_count int The number of times this tweet has been quoted.
reply_count int The number of replies to this tweet.
retweet_count int The number of retweets to this tweet.
tweet❌ string The actual UTF-8 text of the Tweet.
user_verified boolean Indicates if this user is a verified Twitter User.
followers_count int The number of followers of the author.
following_count int The number of following of the author.
tweet_count int Total number of tweets by the author.
listed_count int The number of public lists that this user is a member of.
name❌ string The name of the user.
username❌ string The Twitter screen name, handle, or alias.
user_created_at date The UTC datetime that the user account was created.
description❌ string The text of this user’s profile description (bio).

Download

You can download the dataset from here.

  • AROT-COV23_publish.csv - Contains all features except those marked with ❌ above.
  • AROT-COV23_id_only.csv - Contains only tweets and their author's id.

You can get the source code of our request for tweet data to the Twitter API from here.

  • Twitter_API_Request.py - A Python script for accessing the Twitter API to collect data.
  • Data_Processing.py - A Python script for converting tweets json data to csv format.
  • Tweets_Preprocessing.py - A Python script for pre-processing tweets data.

If you want to access the AROT-COV23 complete version, you need to fill out this form to request.

Examples

Field Value
tweet id 1233338555252006918
author id 805692634127736832
created_at 2020-02-28 10:30:00+00:00
lang ar
like_count 25
quote_count 1
reply_count 0
retweet_count 4
tweet في الصور الملتقطة 27 فبراير 2020، 2 من المرضى...
user_verified True
followers_count 667414
following_count 7
tweet_count 121945
listed_count 671
name CGTN Arabic
username cgtnarabic
user_created_at 2016-12-05T08:37:47.000Z
description شبكة تلفزيون الصين الدولية مؤسسة إعلامية فريدة...

Citation

If you use this dataset in your research, please cite the following paper:

@inproceedings{xu2023arotcov,
  title={{AROT}-{COV}23: A Dataset of 500K Original Arabic Tweets on {COVID}-19},
  author={Cheng Xu and Nan Yan},
  booktitle={4th Workshop on African Natural Language Processing},
  year={2023},
  url={https://openreview.net/forum?id=aUZhVQBl2W}
}

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Similar Datasets


Modalities


Languages