CalBERT - Code-mixed Adaptive Language representations using BERT

A code-mixed language is a type of language that involves the combination of two or more language varieties in its script or speech. Analysis of code-text is difficult to tackle because the language present is not consistent and does not work with existing monolingual approaches. We propose a novel approach to improve performance in Transformers by introducing an additional step called "Siamese Pre-Training", which allows pre-trained monolingual Transformers to adapt language representations for code-mixed languages with a few examples of code-mixed data. The proposed architectures beat the state of the art F1-score on the Sentiment Analysis for Indian Languages (SAIL) dataset, with the highest possible improvement being 5.1 points, while also achieving the state-of-the-art accuracy on the IndicGLUE Product Reviews dataset by beating the benchmark by 0.4 points.

PDF
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Sentiment Analysis IITP Product Reviews Sentiment CalBERT Accuracy 79.4 # 1
Sentiment Analysis SAIL 2017 CalBERT F1 62 # 1
Precision 61.8 # 1
Recall 61.8 # 1

Methods