no code implementations • SIGUL (LREC) 2022 • A. Seza Doğruöz, Sunayana Sitaram
There is a growing interest in building language technologies (LTs) for low resource languages (LRLs).
no code implementations • EACL (HumEval) 2021 • Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram
We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist.
no code implementations • 2 Apr 2024 • Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram
This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL).
no code implementations • 1 Mar 2024 • Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan
To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.
no code implementations • 23 Feb 2024 • Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering.
no code implementations • 21 Feb 2024 • Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
Large vision-language models (VLMs) are widely getting adopted in industry and academia.
no code implementations • 12 Feb 2024 • Prachi Jain, Ashutosh Sathe, Varun Gumma, Kabir Ahuja, Sunayana Sitaram
In this work, we aim to modularly debias a pretrained language model across multiple dimensions.
no code implementations • 15 Jan 2024 • Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram
Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages.
no code implementations • 13 Nov 2023 • Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.
no code implementations • 31 Oct 2023 • A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong
Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions.
1 code implementation • 8 Oct 2023 • Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah
That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss.
no code implementations • 14 Sep 2023 • Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
no code implementations • 4 Jul 2023 • Aniket Vashishtha, Kabir Ahuja, Sunayana Sitaram
While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English.
no code implementations • 28 May 2023 • Akshay Nambi, Vaibhav Balloli, Mercy Ranjit, Tanuja Ganu, Kabir Ahuja, Sunayana Sitaram, Kalika Bali
Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages.
1 code implementation • 22 Mar 2023 • Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages.
no code implementations • 4 Mar 2023 • Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.
no code implementations • 24 Feb 2023 • Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury
With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.
no code implementations • ACL 2021 • A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock, Almeida Jacqueline Toribio
To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.
no code implementations • 22 Nov 2022 • Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali
Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
1 code implementation • 21 Oct 2022 • Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
no code implementations • nlppower (ACL) 2022 • Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.
no code implementations • 24 Mar 2022 • Karthikeyan K, Shaily Bhatt, Pankaj Singh, Somak Aditya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
We compare the TEA CheckLists with CheckLists created with different levels of human intervention.
no code implementations • LREC 2022 • Hemant Yadav, Sunayana Sitaram
Although Automatic Speech Recognition (ASR) systems have achieved human-like performance for a few languages, the majority of the world's languages do not have usable systems due to the lack of large speech datasets to train these models.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
no code implementations • 17 Oct 2021 • Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury
Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.
no code implementations • ICON 2021 • Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram
Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.
1 code implementation • EACL 2021 • Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram
Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.
1 code implementation • 1 Apr 2021 • Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham
For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English.
1 code implementation • 25 Nov 2020 • Hemant Yadav, Atul Anshuman Singh, Rachit Mittal, Sunayana Sitaram, Yi Yu, Rajiv Ratn Shah
Training a robust system, e. g., Speech to Text (STT), requires large datasets.
no code implementations • 12 Nov 2020 • Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram
Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset.
no code implementations • ACL 2020 • Simran Khanuja, D, S apat, ipan, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • 9 Jun 2020 • Gurunath Reddy Madhumani, Sanket Shah, Basil Abraham, Vikas Joshi, Sunayana Sitaram
Recently, we showed that monolingual ASR systems fine-tuned on code-switched data deteriorate in performance on monolingual speech recognition, which is not desirable as ASR systems deployed in multilingual scenarios should recognize both monolingual and code-switched speech with high accuracy.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 1 Jun 2020 • Sanket Shah, Basil Abraham, Gurunath Reddy M, Sunayana Sitaram, Vikas Joshi
In this work, we show that fine-tuning ASR models on code-switched speech harms performance on monolingual speech.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • LREC 2020 • Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, Vivek Seshadri
Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task.
no code implementations • 26 Apr 2020 • Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • LREC 2020 • Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.
no code implementations • ICON 2019 • Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.
no code implementations • WS 2019 • Sanket Shah, Pratik Joshi, Sebastin Santy, Sunayana Sitaram
Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world.
no code implementations • 22 Jun 2019 • Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi
While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.
no code implementations • 25 Mar 2019 • Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W. black
Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world.
no code implementations • EMNLP 2018 • Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram
We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.
no code implementations • WS 2018 • Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury
Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • ACL 2018 • Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali
Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.
Automatic Speech Recognition (ASR) Language Identification +3
no code implementations • WS 2018 • Rallab, SaiKrishna i, Sunayana Sitaram, Alan W. black
We hypothesize that it may be useful for an ASR system to be able to first detect the switching style of a particular utterance from acoustics, and then use specialized language models or other adaptation techniques for decoding the speech.
Automatic Speech Recognition (ASR) Language Identification +1
no code implementations • NAACL 2016 • Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W. black, Lori Levin, Chris Dyer
We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted.
no code implementations • LREC 2016 • Sunayana Sitaram, Alan W. black
Most Text to Speech (TTS) systems today assume that the input text is in a single language and is written in the same language that the text needs to be synthesized in.