Search Results for author: Layne Berry

Found 6 papers, 4 papers with code

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

1 code implementation • 10 Feb 2024 • Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-Yi Lee, Hsin-Min Wang, David Harwath

Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.

Keyword Extraction Multi-Task Learning +2

Paper
Code

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

no code implementations • 8 Feb 2024 • Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-Yi Lee, David Harwath

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks.

Spoken Language Understanding

Paper
Add Code

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.

audio-visual learning Representation Learning

Paper
Code

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations • 2 Nov 2022 • Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Retrieval +1

Paper
Add Code

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

1 code implementation • 1 Nov 2022 • Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning.

Data Augmentation Image Retrieval +2

Paper
Code

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation • 3 Oct 2022 • Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modelling Retrieval +1

104

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.