BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis Dataset

Name:*

Full name (optional):

Description (Markdown and $\LaTeX$ enabled):*

The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on factual information delivery without emotional appeals were labeled non-clickbait. Human labels are labels that are manually derived by volunteer human annotators and AI labels are labels that are generated by a fine-tuned AI model. The following table presents a detailed overview and definitions of the features.

| **Feature Type**          | **Feature Name**     | **Data Type** | **Definition**                                               |
|----------------------------|----------------------|---------------|--------------------------------------------------------------|
| Metadata                   | channel_id           | string        | ID of the YouTube channel                                    |
| Metadata                   | channel_name         | string        | Name of the YouTube channel                                  |
| Metadata                   | channel_url          | string        | URL of the YouTube channel                                   |
| Metadata                   | video_id             | string        | ID of the video                                              |
| Metadata                   | publishedAt          | datetime      | Date and time when the video   was published                 |
| Primary   Data             | title                | string        | Title of the video                                           |
| Primary   Data (Processed) | title_debiased       | string        | Debiased title of the video                                  |
| Primary   Data             | description          | string        | Debiased description of the   video                          |
| Primary   Data (Processed) | description_debiased | string        | Description of the YouTube   video without bias              |
| Metadata                   | url                  | string        | URL of the video                                             |
| Engagement   Stats         | viewCount            | int           | Number of views the video has   received                     |
| Engagement   Stats         | commentCount         | int           | Number of comments on the video                              |
| Engagement   Stats         | likeCount            | int           | Number of likes on the video                                 |
| Engagement   Stats         | dislikeCount         | int           | Number of dislikes on the video                              |
| Metadata                   | thumbnails           | string        | URL of the thumbnail for the   video                         |
| Label                      | auto_labeled         | string        | Automatically labeled using   manual review                  |
| Label   (Processed)        | human_labeled        | string        | Labeled by human                                             |
| Label   (Processed)        | ai_labeled           | string        | Labeled by an AI model   fine-tuned on human labeled data    |

## Paper
* **Data in Brief**: https://doi.org/10.1016/j.dib.2024.110239
* **arXiv Link**: https://arxiv.org/abs/2310.11465

## Dataset
* **Mendeley**: https://data.mendeley.com/datasets/3c6ztw5nft/

## Citation
### MLA
```Al Imran, Abdullah, Md Sakib Hossain Shovon, and M. F. Mridha. "BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis." Data in Brief (2024): 110239.```

### BibText
```
@article{IMRAN2024110239,
title = {BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis},
journal = {Data in Brief},
pages = {110239},
year = {2024},
issn = {2352-3409},
doi = {https://doi.org/10.1016/j.dib.2024.110239},
url = {https://www.sciencedirect.com/science/article/pii/S2352340924002105},
author = {Abdullah Al Imran and Md Sakib Hossain Shovon and M.F. Mridha},
keywords = {Bangla clickbait dataset, YouTube clickbait, Multi-modal clickbait dataset, Multi-feature clickbait dataset, Bangla natural language processing, User behavior modeling, Social Media Analysis},
abstract = {This study presents a large multi-modal Bangla YouTube clickbait dataset consisting of 253,070 data points collected through an automated process using the YouTube API and Python web automation frameworks. The dataset contains 18 diverse features categorized into metadata, primary content, engagement statistics, and labels for individual videos from 58 Bangla YouTube channels. A rigorous preprocessing step has been applied to denoise, deduplicate, and remove bias from the features, ensuring unbiased and reliable analysis. As the largest and most robust clickbait corpus in Bangla to date, this dataset provides significant value for natural language processing and data science researchers seeking to advance modeling of clickbait phenomena in low-resource languages. Its multi-modal nature allows for comprehensive analyses of clickbait across content, user interactions, and linguistic dimensions to develop more sophisticated detection methods with cross-linguistic applications.}
}
```

Homepage URL (optional):

Paper where the dataset was introduced:

Introduction date:

Dataset license:

URL to full license terms:

Image

---

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

Paper

Dataset

Citation

MLA

BibText

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

Feature Type	Feature Name	Data Type	Definition
Metadata	channel_id	string	ID of the YouTube channel
Metadata	channel_name	string	Name of the YouTube channel
Metadata	channel_url	string	URL of the YouTube channel
Metadata	video_id	string	ID of the video
Metadata	publishedAt	datetime	Date and time when the video was published
Primary Data	title	string	Title of the video
Primary Data (Processed)	title_debiased	string	Debiased title of the video
Primary Data	description	string	Debiased description of the video
Primary Data (Processed)	description_debiased	string	Description of the YouTube video without bias
Metadata	url	string	URL of the video
Engagement Stats	viewCount	int	Number of views the video has received
Engagement Stats	commentCount	int	Number of comments on the video
Engagement Stats	likeCount	int	Number of likes on the video
Engagement Stats	dislikeCount	int	Number of dislikes on the video
Metadata	thumbnails	string	URL of the thumbnail for the video
Label	auto_labeled	string	Automatically labeled using manual review
Label (Processed)	human_labeled	string	Labeled by human
Label (Processed)	ai_labeled	string	Labeled by an AI model fine-tuned on human labeled data

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

Paper

Dataset

Citation

MLA

BibText

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit