The AROT-COV23 (ARabic Original Tweets on COVID-19 as of 2023) dataset is a large-scale collection of original Arabic tweets related to COVID-19, spanning from January 2020 to January 2023, and the period for which we collected the data runs from January 1, 2020 to January 5, 2023. The dataset contains approximately 500,000 original tweets, providing a rich source of information on how Arabic-speaking Twitter users have discussed and shared information about the pandemic. For more details on this dataset, please see the paper in the citation section below.
The tweets in the AROT-COV23 dataset were collected using a set of COVID-19-related keywords in Arabic. The tweets were then filtered to ensure that they were written in Arabic. We selected three keywords related to COVID-19 for the data request, the details are in the table below.
Keyword | Description |
---|---|
COVID-19 | Coronavirus disease 2019 (COVID-19) was released by the World Health Organization as the most common name for this disease. |
مرض فيروس كورونا | This is Arabic for "Coronavirus Disease" (COVID-19) |
فيروس كوفيد | This is Arabic for "Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)" |
The dataset has the following features:
⚠️ Please note that due to the restrictions imposed by Twitter's Developer Agreement and Policy on Content redistribution, the data that we make public available does not comprise direct tweet text data and user privacy data.
Field | Type | Description |
---|---|---|
tweet id | string | The unique identifier of the requested Tweet. |
author id | string | The unique identifier of this user. |
created_at | date | Creation time of the Tweet. |
lang | string | Language of the Tweet, if detected by Twitter. |
like_count | int | The number of likes on this tweet |
quote_count | int | The number of times this tweet has been quoted. |
reply_count | int | The number of replies to this tweet. |
retweet_count | int | The number of retweets to this tweet. |
tweet❌ | string | The actual UTF-8 text of the Tweet. |
user_verified | boolean | Indicates if this user is a verified Twitter User. |
followers_count | int | The number of followers of the author. |
following_count | int | The number of following of the author. |
tweet_count | int | Total number of tweets by the author. |
listed_count | int | The number of public lists that this user is a member of. |
name❌ | string | The name of the user. |
username❌ | string | The Twitter screen name, handle, or alias. |
user_created_at | date | The UTC datetime that the user account was created. |
description❌ | string | The text of this user’s profile description (bio). |
You can download the dataset from here.
You can get the source code of our request for tweet data to the Twitter API from here.
If you want to access the AROT-COV23 complete version, you need to fill out this form to request.
Field | Value |
---|---|
tweet id | 1233338555252006918 |
author id | 805692634127736832 |
created_at | 2020-02-28 10:30:00+00:00 |
lang | ar |
like_count | 25 |
quote_count | 1 |
reply_count | 0 |
retweet_count | 4 |
tweet | في الصور الملتقطة 27 فبراير 2020، 2 من المرضى... |
user_verified | True |
followers_count | 667414 |
following_count | 7 |
tweet_count | 121945 |
listed_count | 671 |
name | CGTN Arabic |
username | cgtnarabic |
user_created_at | 2016-12-05T08:37:47.000Z |
description | شبكة تلفزيون الصين الدولية مؤسسة إعلامية فريدة... |
If you use this dataset in your research, please cite the following paper:
@inproceedings{xu2023arotcov,
title={{AROT}-{COV}23: A Dataset of 500K Original Arabic Tweets on {COVID}-19},
author={Cheng Xu and Nan Yan},
booktitle={4th Workshop on African Natural Language Processing},
year={2023},
url={https://openreview.net/forum?id=aUZhVQBl2W}
}
Paper | Code | Results | Date | Stars |
---|