Search Results for author: Sharon Goldwater

Found 55 papers, 13 papers with code

Conditioning, but on Which Distribution? Grammatical Gender in German Plural Inflection

no code implementations • EMNLP (CMCL) 2020 • Kate McCurdy, Adam Lopez, Sharon Goldwater

Grammatical gender is a consistent and informative cue to the plural class of German nouns.

Paper
Add Code

Universal Dependencies and Semantics for English and Hebrew Child-directed Speech

no code implementations • SCiL 2022 • Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Sharon Goldwater, Mark Steedman

Paper
Add Code

Adaptor Grammars for Unsupervised Paradigm Clustering

no code implementations • ACL (SIGMORPHON) 2021 • Kate McCurdy, Sharon Goldwater, Adam Lopez

This work describes the Edinburgh submission to the SIGMORPHON 2021 Shared Task 2 on unsupervised morphological paradigm clustering.

Clustering Task 2

Paper
Add Code

ALDi: Quantifying the Arabic Level of Dialectness of Text

1 code implementation • 20 Oct 2023 • Amr Keleg, Sharon Goldwater, Walid Magdy

Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications.

Dialect Identification Sentence

Paper
Code

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

no code implementations • 3 Jun 2023 • Ramon Sanabria, Ondrej Klejch, Hao Tang, Sharon Goldwater

Acoustic word embeddings are typically created by training a pooling function using pairs of word-like units.

Word Embeddings

Paper
Add Code

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

no code implementations • 21 May 2023 • Oli Liu, Hao Tang, Sharon Goldwater

Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored.

Disentanglement

Paper
Add Code

Prosodic features improve sentence segmentation and parsing

1 code implementation • 23 Feb 2023 • Elizabeth Nielsen, Sharon Goldwater, Mark Steedman

Parsing spoken dialogue presents challenges that parsing text does not, including a lack of clear sentence boundaries.

Sentence Sentence segmentation

Paper
Code

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

no code implementations • 28 Oct 2022 • Ramon Sanabria, Hao Tang, Sharon Goldwater

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments.

Word Embeddings

Paper
Add Code

Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech

2 code implementations • 22 Sep 2021 • Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman

We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.

Language Acquisition Semantic Parsing

Paper
Code

On the Difficulty of Segmenting Words with Attention

no code implementations • EMNLP (insights) 2021 • Ramon Sanabria, Hao Tang, Sharon Goldwater

Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks.

Segmentation speech-recognition +2

Paper
Add Code

Prosodic segmentation for parsing spoken dialogue

no code implementations • ACL 2021 • Elizabeth Nielsen, Mark Steedman, Sharon Goldwater

We investigate how prosody affects a parser that receives an entire dialogue turn as input (a turn-based model), instead of gold standard pre-segmented SUs (an SU-based model).

Segmentation Sentence

Paper
Add Code

Black or White but never neutral: How readers perceive identity from yellow or skin-toned emoji

no code implementations • 12 May 2021 • Alexander Robertson, Walid Magdy, Sharon Goldwater

Research in sociology and linguistics shows that people use language not only to express their own identity but to understand the identity of others.

Sociology

Paper
Add Code

Identity Signals in Emoji Do not Influence Perception of Factual Truth on Twitter

no code implementations • 7 May 2021 • Alexander Robertson, Walid Magdy, Sharon Goldwater

Prior work has shown that Twitter users use skin-toned emoji as an act of self-representation to express their racial/ethnic identity.

Paper
Add Code

A phonetic model of non-native spoken word processing

no code implementations • EACL 2021 • Yevgen Matusevych, Herman Kamper, Thomas Schatz, Naomi H. Feldman, Sharon Goldwater

We then test the model on a spoken word processing task, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers.

Attribute

Paper
Add Code

LemMED: Fast and Effective Neural Morphological Analysis with Short Context Windows

no code implementations • 21 Oct 2020 • Aibek Makazhanov, Sharon Goldwater, Adam Lopez

We present LemMED, a character-level encoder-decoder for contextual morphological analysis (combined lemmatization and tagging).

Decoder Lemmatization +2

Paper
Add Code

Evaluating computational models of infant phonetic learning across languages

no code implementations • 6 Aug 2020 • Yevgen Matusevych, Thomas Schatz, Herman Kamper, Naomi H. Feldman, Sharon Goldwater

In the first year of life, infants' speech perception becomes attuned to the sounds of their native language.

Paper
Add Code

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

1 code implementation • 2 Jun 2020 • Herman Kamper, Yevgen Matusevych, Sharon Goldwater

We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs.

speech-recognition Speech Recognition +1

Paper
Code

Inflecting when there's no majority: Limitations of encoder-decoder neural networks as cognitive models for German plurals

no code implementations • ACL 2020 • Kate McCurdy, Sharon Goldwater, Adam Lopez

Encoder-decoder models do generalize the most frequently produced plural class, but do not show human-like variability or 'regular' extension of these other plural markers.

Decoder

Paper
Add Code

The role of context in neural pitch accent detection in English

no code implementations • EMNLP 2020 • Elizabeth Nielsen, Mark Steedman, Sharon Goldwater

We find that these innovations lead to an improvement from 87. 5% to 88. 7% accuracy on pitch accent detection on American English speech in the Boston University Radio News Corpus, a state-of-the-art result.

Paper
Add Code

Analyzing autoencoder-based acoustic word embeddings

no code implementations • 3 Apr 2020 • Yevgen Matusevych, Herman Kamper, Sharon Goldwater

To better understand the applications of AWEs in various downstream tasks and in cognitive modeling, we need to analyze the representation spaces of AWEs.

Decoder Word Embeddings

Paper
Add Code

Multilingual acoustic word embedding models for processing zero-resource languages

1 code implementation • 6 Feb 2020 • Herman Kamper, Yevgen Matusevych, Sharon Goldwater

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments.

Transfer Learning Word Embeddings

Paper
Code

Analyzing ASR pretraining for low-resource speech-to-text translation

no code implementations • 23 Oct 2019 • Mihaela C. Stoian, Sameer Bansal, Sharon Goldwater

Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Cross-lingual topic prediction for speech using translations

no code implementations • 29 Aug 2019 • Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater

Given a large amount of unannotated speech in a low-resource language, can we classify the speech utterances by topic?

Humanitarian Speech-to-Text Translation +1

Paper
Add Code

Are we there yet? Encoder-decoder neural networks as cognitive models of English past tense inflection

no code implementations • ACL 2019 • Maria Corkery, Yevgen Matusevych, Sharon Goldwater

The cognitive mechanisms needed to account for the English past tense have long been a subject of debate in linguistics and cognitive science.

Decoder

Paper
Add Code

Training Data Augmentation for Context-Sensitive Neural Lemmatizer Using Inflection Tables and Raw Text

1 code implementation • NAACL 2019 • Toms Bergmanis, Sharon Goldwater

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form.

Data Augmentation LEMMA +2

Paper
Code

Training Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text

1 code implementation • 2 Apr 2019 • Toms Bergmanis, Sharon Goldwater

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form.

Data Augmentation LEMMA +2

Paper
Code

Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

1 code implementation • 9 Nov 2018 • Enno Hermann, Herman Kamper, Sharon Goldwater

Here we directly compare multiple methods, including some that use only target language speech data and some that use transcribed speech from other (non-target) languages, and we evaluate using two intrinsic measures as well as on a downstream unsupervised word segmentation and clustering task.

Clustering

Paper
Code

Inducing a lexicon of sociolinguistic variables from code-mixed text

1 code implementation • WS 2018 • Philippa Shoemark, James Kirby, Sharon Goldwater

Sociolinguistics is often concerned with how variants of a linguistic item (e. g., \textit{nothing} vs. \textit{nothin{'}}) are used by different groups or in different situations.

Paper
Code

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

1 code implementation • NAACL 2019 • Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3. 5 to 7. 1

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Code

Context Sensitive Neural Lemmatization with Lematus

no code implementations • NAACL 2018 • Toms Bergmanis, Sharon Goldwater

The main motivation for developing contextsensitive lemmatizers is to improve performance on unseen and ambiguous words.

Decoder Lemmatization +3

Paper
Add Code

Evaluating historical text normalization systems: How well do they generalize?

no code implementations • NAACL 2018 • Alexander Robertson, Sharon Goldwater

We highlight several issues in the evaluation of historical text normalization systems that make it hard to tell how well these systems would actually work in practice---i. e., for new datasets or languages; in comparison to more na\"ive systems; or as a preprocessing step for downstream NLP tools.

POS POS Tagging

Paper
Add Code

Low-Resource Speech-to-Text Translation

no code implementations • 24 Mar 2018 • Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

We explore models trained on between 20 and 160 hours of data, and find that although models trained on less data have considerably lower BLEU scores, they can still predict words with relatively high precision and recall---around 50% for a model trained on 50 hours of data, versus around 60% for the full 160 hour model.

Decoder Machine Translation +4

Paper
Add Code

Multilingual bottleneck features for subword modeling in zero-resource languages

1 code implementation • 23 Mar 2018 • Enno Hermann, Sharon Goldwater

How can we effectively develop speech technology for languages where no transcribed data is available?

speech-recognition Speech Recognition

Paper
Code

Spoken Term Discovery for Language Documentation using Translations

no code implementations • WS 2017 • Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater, Adam Lopez

Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available.

Translation

Paper
Add Code

Topic and audience effects on distinctively Scottish vocabulary usage in Twitter data

no code implementations • WS 2017 • Philippa Shoemark, James Kirby, Sharon Goldwater

Sociolinguistic research suggests that speakers modulate their language style in response to their audience.

Paper
Add Code

Training Data Augmentation for Low-Resource Morphological Inflection

no code implementations • CONLL 2017 • Toms Bergmanis, Katharina Kann, Hinrich Sch{\"u}tze, Sharon Goldwater

Data Augmentation Morphological Inflection +1

Paper
Add Code

From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction

no code implementations • EACL 2017 • Toms Bergmanis, Sharon Goldwater

A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages.

Morphological Analysis Segmentation

Paper
Add Code

Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media

no code implementations • EACL 2017 • Philippa Shoemark, Debnil Sur, Luke Shrimpton, Iain Murray, Sharon Goldwater

Political surveys have indicated a relationship between a sense of Scottish identity and voting decisions in the 2014 Scottish Independence Referendum.

Attribute

Paper
Add Code

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

2 code implementations • 23 Mar 2017 • Herman Kamper, Karen Livescu, Sharon Goldwater

Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing.

Bayesian Inference Clustering +2

Paper
Code

Towards speech-to-text translation without speech recognition

no code implementations • EACL 2017 • Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater

We explore the problem of translating speech to text in low-resource scenarios where neither automatic speech recognition (ASR) nor machine translation (MT) are available, but we have training data in the form of audio paired with text translations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

Weakly supervised spoken term discovery using cross-lingual side information

no code implementations • 21 Sep 2016 • Sameer Bansal, Herman Kamper, Sharon Goldwater, Adam Lopez

Recent work on unsupervised term discovery (UTD) aims to identify and cluster repeated word-like units from audio alone.

Paper
Add Code

Towards robust cross-linguistic comparisons of phonological networks

no code implementations • WS 2016 • Philippa Shoemark, Sharon Goldwater, James Kirby, Rik Sarkar

Language Acquisition

Paper
Add Code

A segmental framework for fully-unsupervised large-vocabulary speech recognition

5 code implementations • 22 Jun 2016 • Herman Kamper, Aren Jansen, Sharon Goldwater

We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding).

Language Modelling Speech Recognition +1

Paper
Code

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

no code implementations • 9 Mar 2016 • Herman Kamper, Aren Jansen, Sharon Goldwater

In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text.

Language Acquisition Language Modelling +1