no code implementations • 21 Mar 2024 • Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang
Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations.
no code implementations • 4 Mar 2024 • Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, Kevin K. Nejad, Felipe Yáñez, Bati Yilmaz, Kangjoo Lee, Alexandra O. Cohen, Valentina Borghesani, Anton Pashkov, Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky, Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Yusifov, Tereza Okalova, Nianlong Gu, Martin Ferianc, Mikail Khona, Kaustubh R. Patil, Pui-Shee Lee, Rui Mata, Nicholas E. Myers, Jennifer K Bizley, Sebastian Musslick, Isil Poyraz Bilgin, Guiomar Niso, Justin M. Ales, Michael Gaebler, N Apurva Ratan Murty, Leyla Loued-Khenissi, Anna Behler, Chloe M. Hall, Jessica Dafflon, Sherry Dongqi Bao, Bradley C. Love
LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts.
no code implementations • 19 Feb 2024 • Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Philip C. Woodland
Foundation models have shown superior performance for speech emotion recognition (SER).
no code implementations • 13 Nov 2023 • Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland
Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks.
1 code implementation • 27 Oct 2023 • Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis
TorchAudio is an open-source audio and speech processing library built for PyTorch.
1 code implementation • 20 Oct 2023 • Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music.
2 code implementations • 9 Oct 2023 • Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs.
no code implementations • 7 Oct 2023 • Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
For the reverse-time process, a parametrised score function is conditioned on a target speaker embedding to extract the target speaker from the mixture of sources.
no code implementations • 25 Sep 2023 • Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Chao Zhang
Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 22 Sep 2023 • Shutong Feng, Guangzhi Sun, Nurul Lubis, Chao Zhang, Milica Gašić
This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 17 Sep 2023 • Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng
Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 8 Sep 2023 • Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang, Fanglei Sun
Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.
no code implementations • 4 Jul 2023 • Guangzhi Sun, Chao Zhang, Ivan Vulić, Paweł Budzianowski, Philip C. Woodland
In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
1 code implementation • 2 Jun 2023 • Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland
End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 30 May 2023 • Guangzhi Sun, Chao Zhang, Phil Woodland
The incorporation of biasing words obtained through contextual knowledge is of paramount importance in automatic speech recognition (ASR) applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
1 code implementation • 29 Oct 2022 • Guangzhi Sun, Chao Zhang, Philip C. Woodland
Specifically, a tree-constrained pointer generator (TCPGen), a powerful and efficient biasing model component, is studied, which leverages a slot shortlist with corresponding entities to extract biasing lists.
no code implementations • 2 Jul 2022 • Guangzhi Sun, Chao Zhang, Philip C. Woodland
Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 18 May 2022 • Guangzhi Sun, Chao Zhang, Philip C Woodland
MBWE and BLMD further improved the effectiveness of TCPGen and achieved more significant WER reductions on the biasing words.
1 code implementation • ACL 2022 • Yang Li, Cheng Yu, Guangzhi Sun, Hua Jiang, Fanglei Sun, Weiqin Zu, Ying Wen, Yang Yang, Jun Wang
Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems.
no code implementations • 1 Sep 2021 • Guangzhi Sun, Chao Zhang, Philip C. Woodland
Contextual knowledge is important for real-world automatic speech recognition (ASR) applications.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • 22 Oct 2020 • Guangzhi Sun, Chao Zhang, Phil Woodland
Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments.
no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Andrew Rosenberg, Bhuvana Ramabhadran, Yonghui Wu
Recent neural text-to-speech (TTS) models with fine-grained latent features enable precise control of the prosody of synthesized speech.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +3
no code implementations • 6 Feb 2020 • Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.
no code implementations • 8 Feb 2019 • Guangzhi Sun, Chao Zhang, Phil Woodland
This combination uses a 2-dimensional (2D) self-attentive structure, which extends the standard self-attentive layer by averaging not only across time but also across different types of embeddings.