Search Results for author: Guangzhi Sun

Found 24 papers, 8 papers with code

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

no code implementations21 Mar 2024 Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang

Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.

speech-recognition Speech Recognition +1

Speech-based Slot Filling using Large Language Models

no code implementations13 Nov 2023 Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland

Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks.

In-Context Learning slot-filling +1

SALMONN: Towards Generic Hearing Abilities for Large Language Models

1 code implementation20 Oct 2023 Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music.

Audio captioning Automatic Speech Recognition +10

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

2 code implementations9 Oct 2023 Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs.

Question Answering Video Question Answering

Conditional Diffusion Model for Target Speaker Extraction

no code implementations7 Oct 2023 Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland

For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources.

Target Speaker Extraction

Connecting Speech Encoder and Large Language Model for ASR

no code implementations25 Sep 2023 Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang

Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Affect Recognition in Conversations Using Large Language Models

no code implementations22 Sep 2023 Shutong Feng, Guangzhi Sun, Nurul Lubis, Chao Zhang, Milica Gašić

This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Enhancing Quantised End-to-End ASR Models via Personalisation

1 code implementation17 Sep 2023 Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Cross-Utterance Conditioned VAE for Speech Generation

no code implementations8 Sep 2023 Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun

Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

Speech Synthesis

Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data

no code implementations4 Jul 2023 Guangzhi Sun, Chao Zhang, Ivan Vulić, Paweł Budzianowski, Philip C. Woodland

In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +6

Can Contextual Biasing Remain Effective with Whisper and GPT-2?

1 code implementation2 Jun 2023 Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland

End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Graph Neural Networks for Contextual ASR with the Tree-Constrained Pointer Generator

1 code implementation30 May 2023 Guangzhi Sun, Chao Zhang, Phil Woodland

The incorporation of biasing words obtained through contextual knowledge is of paramount importance in automatic speech recognition (ASR) applications.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

End-to-end Spoken Language Understanding with Tree-constrained Pointer Generator

1 code implementation29 Oct 2022 Guangzhi Sun, Chao Zhang, Philip C. Woodland

Specifically, a tree-constrained pointer generator (TCPGen), a powerful and efficient biasing model component, is studied, which leverages a slot shortlist with corresponding entities to extract biasing lists.

intent-classification Intent Classification +6

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

no code implementations18 May 2022 Guangzhi Sun, Chao Zhang, Philip C Woodland

MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words.

Dialogue State Tracking Language Modelling +3

Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech

1 code implementation ACL 2022 Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang

Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems.

Combination of Deep Speaker Embeddings for Diarisation

no code implementations22 Oct 2020 Guangzhi Sun, Chao Zhang, Phil Woodland

Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments.

Action Detection Activity Detection +2

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

no code implementations6 Feb 2020 Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu

This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.

Disentanglement Speech Synthesis

Speaker diarisation using 2D self-attentive combination of embeddings

no code implementations8 Feb 2019 Guangzhi Sun, Chao Zhang, Phil Woodland

This combination uses a 2-dimensional (2D) self-attentive structure, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings.

Cannot find the paper you are looking for? You can Submit a new open access paper.