1 code implementation • 18 Jun 2023 • Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng
In this paper, we aim to learn the shared representations across modalities to bridge their gap.
1 code implementation • 14 Jun 2023 • Meng Shen, Yizheng Huang, Jianxiong Yin, Heqing Zou, Deepu Rajan, Simon See
Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality.
1 code implementation • 16 May 2023 • Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng
Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks.
1 code implementation • 16 May 2023 • Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng
However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task.
Audio-Visual Speech Recognition Automatic Speech Recognition +3
no code implementations • 23 Feb 2023 • Chen Chen, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng
Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm.
1 code implementation • 22 Feb 2023 • Yuchen Hu, Chen Chen, Heqing Zou, Xionghu Zhong, Eng Siong Chng
To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
no code implementations • 10 Dec 2022 • Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition.
no code implementations • 13 Apr 2022 • Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng
Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2) The teacher-forcing method leads to the dependence on ground truth during training, which means that model has never been exposed to its own prediction before testing.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • 29 Mar 2022 • Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng
In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
no code implementations • 29 Mar 2022 • Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng
Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio.