Search Results for author: Sunayana Sitaram

Found 47 papers, 6 papers with code

Language Technologies for Low Resource Languages: Sociolinguistic and Multilingual Insights

no code implementations • SIGUL (LREC) 2022 • A. Seza Doğruöz, Sunayana Sitaram

There is a growing interest in building language technologies (LTs) for low resource languages (LRLs).

Paper
Add Code

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

no code implementations • EACL (HumEval) 2021 • Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram

We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist.

Data Augmentation

Paper
Add Code

METAL: Towards Multilingual Meta-Evaluation

no code implementations • 2 Apr 2024 • Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram

This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL).

Paper
Add Code

Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs

no code implementations • 1 Mar 2024 • Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan

To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.

Benchmarking

Paper
Add Code

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

no code implementations • 23 Feb 2024 • Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering.

Question Answering Text Generation

Paper
Add Code

A Unified Framework and Dataset for Assessing Gender Bias in Vision-Language Models

no code implementations • 21 Feb 2024 • Ashutosh Sathe, Prachi Jain, Sunayana Sitaram

Large vision-language models (VLMs) are widely getting adopted in industry and academia.

Benchmarking

Paper
Add Code

MAFIA: Multi-Adapter Fused Inclusive LanguAge Models

no code implementations • 12 Feb 2024 • Prachi Jain, Ashutosh Sathe, Varun Gumma, Kabir Ahuja, Sunayana Sitaram

In this work, we aim to modularly debias a pretrained language model across multiple dimensions.

counterfactual Data Augmentation +1

Paper
Add Code

MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models

no code implementations • 15 Jan 2024 • Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram

Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages.

Paper
Add Code

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

no code implementations • 13 Nov 2023 • Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

Benchmarking

Paper
Add Code

Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

no code implementations • 31 Oct 2023 • A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong

Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions.

Paper
Add Code

Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting

1 code implementation • 8 Oct 2023 • Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah

That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss.

Speech Synthesis

Paper
Code

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

no code implementations • 14 Sep 2023 • Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.

Language Modelling Large Language Model +2

Paper
Add Code

On Evaluating and Mitigating Gender Biases in Multilingual Settings

no code implementations • 4 Jul 2023 • Aniket Vashishtha, Kabir Ahuja, Sunayana Sitaram

While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English.

Paper
Add Code

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

no code implementations • 28 May 2023 • Akshay Nambi, Vaibhav Balloli, Mercy Ranjit, Tanuja Ganu, Kabir Ahuja, Sunayana Sitaram, Kalika Bali

Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages.

Question Answering Retrieval

Paper
Add Code

MEGA: Multilingual Evaluation of Generative AI

1 code implementation • 22 Mar 2023 • Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages.

Benchmarking

Paper
Code

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

no code implementations • 4 Mar 2023 • Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.

Zero-Shot Cross-Lingual Transfer

Paper
Add Code

Fairness in Language Models Beyond English: Gaps and Challenges

no code implementations • 24 Feb 2023 • Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury

With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.

Fairness

Paper
Add Code

A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies

no code implementations • ACL 2021 • A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock, Almeida Jacqueline Toribio

To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.

Paper
Add Code

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

no code implementations • 22 Nov 2022 • Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali

Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

On the Calibration of Massively Multilingual Language Models

1 code implementation • 21 Oct 2022 • Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury

Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.

Cross-Lingual Transfer

Paper
Code

Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

no code implementations • nlppower (ACL) 2022 • Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.

Benchmarking Multilingual NLP +1

Paper
Add Code

Multilingual CheckList: Generation and Evaluation

no code implementations • 24 Mar 2022 • Karthikeyan K, Shaily Bhatt, Pankaj Singh, Somak Aditya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

We compare the TEA CheckLists with CheckLists created with different levels of human intervention.

Machine Translation

Paper
Add Code

A Survey of Multilingual Models for Automatic Speech Recognition

no code implementations • LREC 2022 • Hemant Yadav, Sunayana Sitaram

Although Automatic Speech Recognition (ASR) systems have achieved human-like performance for a few languages, the majority of the world's languages do not have usable systems due to the lack of large speech datasets to train these models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Predicting the Performance of Multilingual NLP Models

no code implementations • 17 Oct 2021 • Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury

Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.

Multilingual NLP

Paper
Add Code

On the Universality of Deep Contextual Language Models

no code implementations • ICON 2021 • Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram

Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.

XLM-R Zero-Shot Cross-Lingual Transfer

Paper
Add Code

GCM: A Toolkit for Generating Synthetic Code-mixed Text

1 code implementation • EACL 2021 • Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram

Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.

Paper
Code

Multilingual and code-switching ASR challenges for low resource Indian languages

1 code implementation • 1 Apr 2021 • Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham

For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English.

Automatic Speech Recognition (ASR) Sentence

Paper
Code

mask-Net: Learning Context Aware Invariant Features using Adversarial Forgetting (Student Abstract)

1 code implementation • 25 Nov 2020 • Hemant Yadav, Atul Anshuman Singh, Rachit Mittal, Sunayana Sitaram, Yi Yu, Rajiv Ratn Shah

Training a robust system, e. g., Speech to Text (STT), requires large datasets.

Paper
Code

Cross-lingual and Multilingual Spoken Term Detection for Low-Resource Indian Languages

no code implementations • 12 Nov 2020 • Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram

Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset.

Paper
Add Code

GLUECoS: An Evaluation Benchmark for Code-Switched NLP

no code implementations • ACL 2020 • Simran Khanuja, D, S apat, ipan, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury

We present results on all these tasks using cross-lingual word embedding models and multilingual models.

Language Identification named-entity-recognition +7

Paper
Add Code

Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition

no code implementations • 9 Jun 2020 • Gurunath Reddy Madhumani, Sanket Shah, Basil Abraham, Vikas Joshi, Sunayana Sitaram

Recently, we showed that monolingual ASR systems fine-tuned on code-switched data deteriorate in performance on monolingual speech recognition, which is not desirable as ASR systems deployed in multilingual scenarios should recognize both monolingual and code-switched speech with high accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Learning to Recognize Code-switched Speech Without Forgetting Monolingual Speech Recognition

no code implementations • 1 Jun 2020 • Sanket Shah, Basil Abraham, Gurunath Reddy M, Sunayana Sitaram, Vikas Joshi

In this work, we show that fine-tuning ASR models on code-switched speech harms performance on monolingual speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers

no code implementations • LREC 2020 • Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, Vivek Seshadri

Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task.

Paper
Add Code

GLUECoS : An Evaluation Benchmark for Code-Switched NLP

no code implementations • 26 Apr 2020 • Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury

We present results on all these tasks using cross-lingual word embedding models and multilingual models.

Language Identification named-entity-recognition +7

Paper
Add Code

A New Dataset for Natural Language Inference from Code-mixed Conversations

no code implementations • LREC 2020 • Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.

Natural Language Inference

Paper
Add Code

Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities

no code implementations • ICON 2019 • Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali

In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.

Paper
Add Code

CoSSAT: Code-Switched Speech Annotation Tool

no code implementations • WS 2019 • Sanket Shah, Pratik Joshi, Sebastin Santy, Sunayana Sitaram

Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world.

Paper
Add Code

End-to-End ASR for Code-switched Hindi-English Speech

no code implementations • 22 Jun 2019 • Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi

While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.

Multi-Task Learning

Paper
Add Code

A Survey of Code-switched Speech and Language Processing

no code implementations • 25 Mar 2019 • Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W. black

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world.

Paper
Add Code

Word Embeddings for Code-Mixed Language Processing

no code implementations • EMNLP 2018 • Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram

We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.

Machine Translation POS +3

Paper
Add Code

Phone Merging For Code-Switched Speech Recognition

no code implementations • WS 2018 • Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury

Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

no code implementations • ACL 2018 • Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.

Automatic Speech Recognition (ASR) Language Identification +3

Paper
Add Code

Automatic Detection of Code-switching Style from Acoustics

no code implementations • WS 2018 • Rallab, SaiKrishna i, Sunayana Sitaram, Alan W. black

We hypothesize that it may be useful for an ASR system to be able to first detect the switching style of a particular utterance from acoustics, and then use specialized language models or other adaptation techniques for decoding the speech.

Automatic Speech Recognition (ASR) Language Identification +1

Paper
Add Code

Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach

no code implementations • LREC 2018 • Sunayana Sitaram, Varun Manjunath, Varun Bharadwaj, Monojit Choudhury, Kalika Bali, Michael Tjalve

Automatic Speech Recognition (ASR)

Paper
Add Code

Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks

no code implementations • WS 2017 • Monojit Choudhury, Kalika Bali, Sunayana Sitaram, Ashutosh Baheti

Language Identification Language Modelling

Paper
Add Code

Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning

no code implementations • NAACL 2016 • Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W. black, Lori Levin, Chris Dyer

We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted.

Representation Learning

Paper
Add Code

Speech Synthesis of Code-Mixed Text

no code implementations • LREC 2016 • Sunayana Sitaram, Alan W. black

Most Text to Speech (TTS) systems today assume that the input text is in a single language and is written in the same language that the text needs to be synthesized in.

Language Identification Speech Synthesis

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.