The Vashantor dataset consists of 32,500 sentences from different regions, including Chittagong, Noakhali, Sylhet, Barishal, and Mymensingh. It is categorized into two language formats: "Bangla" and "Banglish." Each region and language combination has specified quantities for training, testing, and validation samples. The dataset details are as follows:
Specifics of the Core Data:
Type |
Bangla |
Banglish |
English |
Train |
1875 |
1875 |
1875 |
Test |
375 |
375 |
375 |
Validation |
250 |
250 |
250 |
Specifics of the Regional Data:
Region |
Type |
Train |
Test |
Validation |
Chittagong |
Bangla |
1875 |
375 |
250 |
|
Banglish |
1875 |
375 |
250 |
Noakhali |
Bangla |
1875 |
375 |
250 |
|
Banglish |
1875 |
375 |
250 |
Sylhet |
Bangla |
1875 |
375 |
250 |
|
Banglish |
1875 |
375 |
250 |
Barishal |
Bangla |
1875 |
375 |
250 |
|
Banglish |
1875 |
375 |
250 |
Mymensingh |
Bangla |
1875 |
375 |
250 |
|
Banglish |
1875 |
375 |
250 |