Search Results for author: Guangzhi Sun

Found 24 papers, 8 papers with code

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

no code implementations • 21 Mar 2024 • Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.

speech-recognition Speech Recognition +1

Paper
Add Code

Large language models surpass human experts in predicting neuroscience results

no code implementations • 4 Mar 2024 • Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yilmaz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikail Khona, Kaustubh R. Patil, Pui-Shee Lee, Rui Mata, Nicholas E. Myers, Jennifer K Bizley, Sebastian Musslick, Isil Poyraz Bilgin, Guiomar Niso, Justin M. Ales, Michael Gaebler, N Apurva Ratan Murty, Leyla Loued-Khenissi, Anna Behler, Chloe M. Hall, Jessica Dafflon, Sherry Dongqi Bao, Bradley C. Love

LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts.

Paper
Add Code

Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation

no code implementations • 19 Feb 2024 • Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Philip C. Woodland

Foundation models have shown superior performance for speech emotion recognition (SER).

Cross-corpus Domain Adaptation +1

Paper
Add Code

Speech-based Slot Filling using Large Language Models

no code implementations • 13 Nov 2023 • Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland

Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks.

In-Context Learning slot-filling +1

Paper
Add Code

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

1 code implementation • 27 Oct 2023 • Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

TorchAudio is an open-source audio and speech processing library built for PyTorch.

Self-Supervised Learning Speech Enhancement +2

2,379

Paper
Code

SALMONN: Towards Generic Hearing Abilities for Large Language Models

1 code implementation • 20 Oct 2023 • Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music.

Audio captioning Automatic Speech Recognition +10

798

Paper
Code

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

2 code implementations • 9 Oct 2023 • Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs.

Question Answering Video Question Answering

Paper
Code

Conditional Diffusion Model for Target Speaker Extraction

no code implementations • 7 Oct 2023 • Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland

For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources.

Target Speaker Extraction

Paper
Add Code

Connecting Speech Encoder and Large Language Model for ASR

no code implementations • 25 Sep 2023 • Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Affect Recognition in Conversations Using Large Language Models

no code implementations • 22 Sep 2023 • Shutong Feng, Guangzhi Sun, Nurul Lubis, Chao Zhang, Milica Gašić

This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Paper
Add Code

Enhancing Quantised End-to-End ASR Models via Personalisation

1 code implementation • 17 Sep 2023 • Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Cross-Utterance Conditioned VAE for Speech Generation

no code implementations • 8 Sep 2023 • Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

Speech Synthesis

Paper
Add Code

Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data

no code implementations • 4 Jul 2023 • Guangzhi Sun, Chao Zhang, Ivan Vulić, Paweł Budzianowski, Philip C. Woodland

In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Paper
Add Code

Can Contextual Biasing Remain Effective with Whisper and GPT-2?

1 code implementation • 2 Jun 2023 • Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland

End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

Graph Neural Networks for Contextual ASR with the Tree-Constrained Pointer Generator

1 code implementation • 30 May 2023 • Guangzhi Sun, Chao Zhang, Phil Woodland

The incorporation of biasing words obtained through contextual knowledge is of paramount importance in automatic speech recognition (ASR) applications.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Code

End-to-end Spoken Language Understanding with Tree-constrained Pointer Generator

1 code implementation • 29 Oct 2022 • Guangzhi Sun, Chao Zhang, Philip C. Woodland

Specifically, a tree-constrained pointer generator (TCPGen), a powerful and efficient biasing model component, is studied, which leverages a slot shortlist with corresponding entities to extract biasing lists.

intent-classification Intent Classification +6

Paper
Code

Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition

no code implementations • 2 Jul 2022 • Guangzhi Sun, Chao Zhang, Philip C. Woodland

Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

no code implementations • 18 May 2022 • Guangzhi Sun, Chao Zhang, Philip C Woodland

MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words.

Dialogue State Tracking Language Modelling +3

Paper
Add Code

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

1 code implementation • ACL 2022 • Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang

Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems.

Paper
Code

Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

no code implementations • 1 Sep 2021 • Guangzhi Sun, Chao Zhang, Philip C. Woodland

Contextual knowledge is important for real-world automatic speech recognition (ASR) applications.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Combination of Deep Speaker Embeddings for Diarisation

no code implementations • 22 Oct 2020 • Guangzhi Sun, Chao Zhang, Phil Woodland

Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments.

Action Detection Activity Detection +2

Paper
Add Code

Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu

Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.

Disentanglement Speech Synthesis

Paper
Add Code

Speaker diarisation using 2D self-attentive combination of embeddings

no code implementations • 8 Feb 2019 • Guangzhi Sun, Chao Zhang, Phil Woodland

This combination uses a 2-dimensional (2D) self-attentive structure, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.