NC-SentNoB (Noise Classification on SentNoB)

Introduced by Elahi et al. in A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts

This is a multilabel dataset used for Noise Identification purpose in the paper "A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts" accepted in 2024 The 9th Workshop on Noisy and User-generated Text (W-NUT) collocated with EACL 2024.

  • Annotated by 4 native Bangla speakers with 90% trustworthiness score.
  • Fleiss' Kappa Score: 0.69

Definition of noise categories

Type Definition
Local Word Any regional words even if there is a spelling error.
Word Misuse Wrong use of words or unnecessary repetitions of words.
Context/Word Missing Not enough information or missing words.
Wrong Serial Wrong order of the words.
Mixed Language Words in another language. Foreign words that were adopted into the Bangla language over time are excluded from this type.
Punctuation Error Improper placement or missing punctuation. Sentences ending without "।" were excluded from this type.
Spacing Error Improper use of white space.
Spelling Error Words not following spelling of Bangla Academy Dictionary.
Coined Word Emoji, symbolic emoji, link.
Others Noises that do not fall into categories mentioned above.

Statistics of NC-SentNoB per noise class

Class Instances #Word/Instance
Local Word 2,084 (0.136%) 16.05
Word Misuse 661 (0.043%) 18.55
Context/Word Missing 550 (0.036%) 13.19
Wrong Serial 69 (0.005%) 15.30
Mixed Language 6,267 (0.410%) 17.91
Punctuation Error 5,988 (0.391%) 17.25
Spacing Error 2,456 (0.161%) 18.78
Spelling Error 5,817 (0.380%) 17.30
Coined Words 549 (0.036% 15.45
Others 1,263 (0.083%) 16.52

Heatmap of correlation coefficient

Citation

If you use the datasets, please cite the following paper:

@misc{elahi2024comparative,
      title={A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts}, 
      author={Kazi Toufique Elahi and Tasnuva Binte Rahman and Shakil Shahriar and Samir Sarker and Md. Tanvir Rouf Shawon and G. M. Shahariar},
      year={2024},
      eprint={2401.14360},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Papers


Paper Code Results Date Stars

Tasks


Similar Datasets


License


  • cc-by-sa-4.0

Modalities


Languages