Search Results for author: Puyuan Peng

Found 15 papers, 9 papers with code

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

2 code implementations • 25 Mar 2024 • Puyuan Peng, Po-Yao Huang, Abdelrahman Mohamed, David Harwath

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Language Modelling

6,594

Paper
Code

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

1 code implementation • 10 Feb 2024 • Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-Yi Lee, Hsin-Min Wang, David Harwath

Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.

Keyword Extraction Multi-Task Learning +2

Paper
Code

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

no code implementations • 8 Feb 2024 • Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-Yi Lee, David Harwath

Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks.

Spoken Language Understanding

Paper
Add Code

BAT: Learning to Reason about Spatial Sounds with Large Language Models

no code implementations • 2 Feb 2024 • Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment.

Event Detection Language Modelling +5

Paper
Add Code

Audio-Visual Neural Syntax Acquisition

no code implementations • 11 Oct 2023 • Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

We study phrase structure induction from visually-grounded speech.

Language Acquisition

Paper
Add Code

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

1 code implementation • 19 Sep 2023 • Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-Yi Lee

Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information.

audio-visual learning Representation Learning

Paper
Code

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations • 27 Jun 2023 • Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

Paper
Add Code

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

2 code implementations • 19 May 2023 • Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Language Modelling Masked Language Modeling +3

Paper
Code

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation • 18 May 2023 • Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

122

Paper
Code

Zero-shot Video Moment Retrieval With Off-the-Shelf Models

no code implementations • 3 Nov 2022 • Anuj Diwan, Puyuan Peng, Raymond J. Mooney

For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks.

Moment Retrieval Retrieval

Paper
Add Code

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

2 code implementations • 30 Mar 2022 • Alan Baade, Puyuan Peng, David Harwath

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Audio Classification