CSVC-Net: Code-Switched Voice Command Classification using Deep CNN-LSTM Network

Colloquial Bengali has adopted many English words due to colonial influence. In conversational Bengali, it is quite common to speak in a mixture of English and Bengali, a phenomenon termed Code-switching (CS). To build a Voice Command Classifier in this era, when the usage of CS is ever-increasing, it is often necessary to map a single base command to its many different variants - spoken in multiple mixtures of languages. The works done with Bengali Speech have been primarily focused on single word classification and mostly incompetent in understanding the complex semantic relationships displayed in sentences. This paper proposes ‘CSVC-Net’, a CNN-LSTM based architecture for classifying spoken commands that exhibit code-switching between Bengali and English. To effectively reflect the scenario, it also presents a newly curated dataset named ‘Banglish’ containing 3,840 audio files of spoken computer commands belonging to 11 classes, considering 64 variations in total. The proposed pipeline passes the input audio signal through a series of appropriate transformation and augmentation steps enabling the model to achieve an accuracy of 92.08% on the curated dataset. Furthermore, the robustness of the proposed model has been justified by comparing with different architectures and tested under different noise levels with promising accuracy, which shows the applicability of the model in real-life scenarios.

PDF

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Voice Query Recognition Banglish CSVC-Net Accuracy (%) 92.08 # 1

Methods