Search Results for author: Zhou Zhao

Found 164 papers, 60 papers with code

Prior Knowledge and Memory Enriched Transformer for Sign Language Translation

no code implementations Findings (ACL) 2022 Tao Jin, Zhou Zhao, Meng Zhang, Xingshan Zeng

This paper attacks the challenging problem of sign language translation (SLT), which involves not only visual and textual understanding but also additional prior knowledge learning (i. e. performing style, syntax).

POS Sentence +2

WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising

no code implementations18 Mar 2024 Haoyu Zhao, Yuliang Gu, Zhou Zhao, Bo Du, Yongchao Xu, Rui Yu

Second, to better capture high-frequency components and detailed information, Frequency-Aware Multi-scale Loss (FAM) is proposed by effectively utilizing multi-scale feature space.

Image Denoising

MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation

no code implementations18 Mar 2024 Haoyu Zhao, Wenhui Dong, Rui Yu, Zhou Zhao, Du Bo, Yongchao Xu

The task of single-source domain generalization (SDG) in medical image segmentation is crucial due to frequent domain shifts in clinical image datasets.

Data Augmentation Domain Generalization +4

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

no code implementations18 Mar 2024 Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, RuiQi Li, Wenrui Liu, Fuming You, Tao Jin, Zhou Zhao

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly.

Attribute Singing Voice Synthesis

Unlocking the Potential of Multimodal Unified Discrete Representation through Training-Free Codebook Optimization and Hierarchical Alignment

1 code implementation8 Mar 2024 Hai Huang, Yan Xia, Shengpeng Ji, Shulei Wang, Hanting Wang, Jieming Zhu, Zhenhua Dong, Zhou Zhao

The Dual Cross-modal Information Disentanglement (DCID) model, utilizing a unified codebook, shows promising results in achieving fine-grained representation and cross-modal generalization.

Disentanglement

A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching

no code implementations5 Mar 2024 Dong Yao, Asaad Alghamdi, Qingrong Xia, Xiaoye Qu, Xinyu Duan, Zhefeng Wang, Yi Zheng, Baoxing Huai, Peilun Cheng, Zhou Zhao

Although DC-Match is a simple yet effective method for semantic matching, it highly depends on the external NER techniques to identify the keywords of sentences, which limits the performance of semantic matching for minor languages since satisfactory NER tools are usually hard to obtain.

Chatbot Community Question Answering +4

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

no code implementations14 Feb 2024 Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation.

Voice Cloning

Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis

no code implementations16 Jan 2024 Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao

One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video.

3D Reconstruction Super-Resolution +1

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

no code implementations23 Dec 2023 Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, Changpeng Yang, Zhou Zhao

However, talking head translation, converting audio-visual speech (i. e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.

Self-Supervised Learning Speech-to-Speech Translation +1

Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video Grounding

no code implementations21 Dec 2023 Haifeng Huang, Yang Zhao, Zehan Wang, Yan Xia, Zhou Zhao

Thus, to address this issue and enhance model performance on new scenes, we explore the TVG task in an unsupervised domain adaptation (UDA) setting across scenes for the first time, where the video-query pairs in the source scene (domain) are labeled with temporal boundaries, while those in the target scene are not.

Unsupervised Domain Adaptation Video Grounding

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

no code implementations17 Dec 2023 Yu Zhang, Rongjie Huang, RuiQi Li, Jinzheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase.

Quantization Singing Voice Synthesis +1

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers

2 code implementations13 Dec 2023 Haifeng Huang, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tao Jin, Zhou Zhao

These tokens capture the object's attributes and spatial relationships with surrounding objects in the 3D scene.

Attribute Object +1

Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

no code implementations30 Nov 2023 Liangcai Su, Fan Yan, Jieming Zhu, Xi Xiao, Haoyi Duan, Zhou Zhao, Zhenhua Dong, Ruiming Tang

Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications.

Retrieval

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

1 code implementation NeurIPS 2023 Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features.

Unsupervised Discovery of Interpretable Directions in h-space of Pre-trained Diffusion Models

no code implementations15 Oct 2023 Zijian Zhang, Luping Liu, Zhijie Lin, Yichen Zhu, Zhou Zhao

We propose the first unsupervised and learning-based method to identify interpretable directions in h-space of pre-trained diffusion models.

Extending Multi-modal Contrastive Representations

1 code implementation13 Oct 2023 Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao Jin, Zhou Zhao

Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces.

3D Object Classification Representation Learning +1

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

no code implementations14 Sep 2023 Yongqi Wang, Jionghao Bai, Rongjie Huang, RuiQi Li, Zhiqing Hong, Zhou Zhao

Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation.

In-Context Learning Language Modelling +3

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

no code implementations28 Aug 2023 Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

The dataset comprises 236, 220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples.

Language Modelling

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

1 code implementation17 Aug 2023 Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao

This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes.

Language Modelling Large Language Model +1

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

no code implementations25 Jul 2023 Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao

3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.

Object Position +3

DisCover: Disentangled Music Representation Learning for Cover Song Identification

no code implementations19 Jul 2023 Jiahao Xun, Shengyu Zhang, Yanting Yang, Jieming Zhu, Liqun Deng, Zhou Zhao, Zhenhua Dong, RuiQi Li, Lichao Zhang, Fei Wu

We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning.

Blocking Cover song identification +3

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

1 code implementation ICCV 2023 Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao

To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.

Object Semantic Similarity +3

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

no code implementations14 Jul 2023 Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage.

In-Context Learning Language Modelling +3

Gloss Attention for Gloss-free Sign Language Translation

1 code implementation CVPR 2023 Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, Zhou Zhao

We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally.

Gloss-free Sign Language Translation Language Modelling +4

MSSRNet: Manipulating Sequential Style Representation for Unsupervised Text Style Transfer

1 code implementation12 Jun 2023 Yazheng Yang, Zhou Zhao, Qi Liu

Our proposed method addresses this issue by assigning individual style vector to each token in a text, allowing for fine-grained control and manipulation of the style strength.

Style Transfer Text Style Transfer +1

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

1 code implementation10 Jun 2023 Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, Zhou Zhao

We demonstrate that OpenSR enables modality transfer from one to any in three different settings (zero-, few- and full-shot), and achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.

Audio-Visual Speech Recognition Lip Reading +2

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

no code implementations6 Jun 2023 Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao

3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies.

Attribute Inductive Bias +3

Detector Guidance for Multi-Object Text-to-Image Generation

1 code implementation4 Jun 2023 Luping Liu, Zijian Zhang, Yi Ren, Rongjie Huang, Xiang Yin, Zhou Zhao

Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment.

Object object-detection +2

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

no code implementations30 May 2023 Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common.

Singing Voice Synthesis Voice Conversion

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

no code implementations29 May 2023 Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao

Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data.

Audio Generation Denoising +2

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

no code implementations24 May 2023 Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin, Zhou Zhao

Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date.

Speech-to-Speech Translation Translation

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

no code implementations22 May 2023 Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao

To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information.

Denoising Self-Supervised Learning

Connecting Multi-modal Contrastive Representations

no code implementations NeurIPS 2023 Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao

This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR).

3D Point Cloud Classification counterfactual +4

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

no code implementations21 May 2023 Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao

Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data.

SQL Parsing

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

no code implementations18 May 2023 Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui, Huadai Liu, Zhou Zhao

To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience.

Singing Voice Synthesis

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

no code implementations8 May 2023 RuiQi Li, Rongjie Huang, Lichao Zhang, Jinglin Liu, Zhou Zhao

The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings while facing a major challenge: the alignment between the target (singing) pitch contour and the source (speech) content is difficult to learn in a text-free situation.

STS Voice Conversion

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations

2 code implementations6 May 2023 Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, WeiJie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang

In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations.

Image-text matching Text Matching

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

1 code implementation CVPR 2023 Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun Yu

A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control.

Question Answering Spatio-temporal Scene Graphs +1

Denoising Multi-modal Sequential Recommenders with Contrastive Learning

no code implementations3 May 2023 Dong Yao, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Wenqiao Zhang, Rui Zhang, Xiaofei He, Fei Wu

In contrast, modalities that do not cause users' behaviors are potential noises and might mislead the learning of a recommendation model.

Contrastive Learning Denoising +2

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

no code implementations1 May 2023 Zhenhui Ye, Jinzheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, Jinglin Liu, Yi Ren, Xiang Yin, Zejun Ma, Zhou Zhao

Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.

motion prediction Talking Face Generation

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

1 code implementation25 Apr 2023 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i. e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue.

Set-Based Face Recognition Beyond Disentanglement: Burstiness Suppression With Variance Vocabulary

no code implementations13 Apr 2023 Jiong Wang, Zhou Zhao, Fei Wu

Thus we propose to separate the identity features with the variance features in a light-weighted set-based disentanglement framework.

Disentanglement Face Recognition

MUG: A General Meeting Understanding and Generation Benchmark

1 code implementation24 Mar 2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection.

Extractive Summarization Keyphrase Extraction +1

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

no code implementations24 Mar 2023 Qinglin Zhang, Chong Deng, Jiaqing Liu, Hai Yu, Qian Chen, Wen Wang, Zhijie Yan, Jinglin Liu, Yi Ren, Zhou Zhao

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) focuses on prompting a wide range of spoken language processing (SLP) research on meeting transcripts, as SLP applications are critical to improve users' efficiency in grasping important information in meetings.

Extractive Summarization Keyphrase Extraction

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

2 code implementations ICCV 2023 Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao

However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.

Lip Reading Machine Translation +4

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

no code implementations5 Feb 2023 Zijian Zhang, Zhou Zhao, Jun Yu, Qi Tian

In this paper, we propose a novel and flexible conditional diffusion model by introducing conditions into the forward process.

Denoising Image Generation

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

1 code implementation31 Jan 2023 Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, Zhou Zhao

Generating photo-realistic video portrait with arbitrary speech audio is a crucial problem in film-making and virtual reality.

Lip Reading Talking Face Generation +1

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

1 code implementation30 Jan 2023 Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.

Audio Generation Text-to-Video Generation +1

Exploring Group Video Captioning with Efficient Relational Approximation

no code implementations ICCV 2023 Wang Lin, Tao Jin, Ye Wang, Wenwen Pan, Linjun Li, Xize Cheng, Zhou Zhao

In this study, we propose a new task, group video captioning, which aims to infer the desired content among a group of target videos and describe it with another group of related reference videos.

Video Captioning

WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding

no code implementations CVPR 2023 Mengze Li, Han Wang, Wenqiao Zhang, Jiaxu Miao, Zhou Zhao, Shengyu Zhang, Wei Ji, Fei Wu

WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos.

Contrastive Learning Spatio-Temporal Video Grounding +1

Open-Vocabulary Object Detection With an Open Corpus

no code implementations ICCV 2023 Jiong Wang, Huiming Zhang, Haiwen Hong, Xuan Jin, Yuan He, Hui Xue, Zhou Zhao

For the classification task, we introduce an open corpus classifier by reconstructing original classifier with similar words in text space.

Object object-detection +1

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

2 code implementations26 Dec 2022 Zijian Zhang, Zhou Zhao, Zhijie Lin

These imply that the gap corresponds to the lost information of the image, and we can reconstruct the image by filling the gap.

Image Reconstruction Representation Learning

DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect

no code implementations14 Dec 2022 Jinglin Liu, Zhenhui Ye, Qian Chen, Siqi Zheng, Wen Wang, Qinglin Zhang, Zhou Zhao

Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities.

Audio Synthesis

Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection

1 code implementation21 Nov 2022 Luping Liu, Yi Ren, Xize Cheng, Rongjie Huang, Chongxuan Li, Zhou Zhao

In this paper, we introduce a new perceptron bias assumption that suggests discriminator models are more sensitive to certain features of the input, leading to the overconfidence problem.

Denoising Out-of-Distribution Detection +1

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

no code implementations19 Nov 2022 Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, Zhou Zhao

In this paper, we pose the task of generating sound with a specific timbre given a video input and a reference audio sample.

Disentanglement

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

no code implementations8 Sep 2022 Jiong Wang, Zhou Zhao, Weike Jin

Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question.

Question Answering Video Question Answering

Video-Guided Curriculum Learning for Spoken Video Grounding

1 code implementation1 Sep 2022 Yan Xia, Zhou Zhao, Shangwei Ye, Yang Zhao, Haoyuan Li, Yi Ren

To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise.

Video Grounding

AntCritic: Argument Mining for Free-Form and Visually-Rich Financial Comments

no code implementations20 Aug 2022 Yang Zhao, Wenqiang Xu, Xuan Lin, Jingjing Huo, Hong Chen, Zhou Zhao

The task of argument mining aims to detect all possible argumentative components and identify their relationships automatically.

Argument Mining

Re4: Learning to Re-contrast, Re-attend, Re-construct for Multi-interest Recommendation

1 code implementation17 Aug 2022 Shengyu Zhang, Lingxiao Yang, Dong Yao, Yujie Lu, Fuli Feng, Zhou Zhao, Tat-Seng Chua, Fei Wu

Specifically, Re4 encapsulates three backward flows, i. e., 1) Re-contrast, which drives each interest embedding to be distinct from other interests using contrastive learning; 2) Re-attend, which ensures the interest-item correlation estimation in the forward flow to be consistent with the criterion used in final recommendation; and 3) Re-construct, which ensures that each interest embedding can semantically reflect the information of representative items that relate to the corresponding interest.

Contrastive Learning Recommendation Systems

CCL4Rec: Contrast over Contrastive Learning for Micro-video Recommendation

no code implementations17 Aug 2022 Shengyu Zhang, Bofang Li, Dong Yao, Fuli Feng, Jieming Zhu, Wenyan Fan, Zhou Zhao, Xiaofei He, Tat-Seng Chua, Fei Wu

Micro-video recommender systems suffer from the ubiquitous noises in users' behaviors, which might render the learned user representation indiscriminating, and lead to trivial recommendations (e. g., popular items) or even weird ones that are far beyond users' interests.

Contrastive Learning Recommendation Systems

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

3 code implementations13 Jul 2022 Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, Yi Ren

Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling.

Denoising Knowledge Distillation +3

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

no code implementations8 Jul 2022 Yongqi Wang, Zhou Zhao

To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size.

Lip to Speech Synthesis Speech Synthesis

AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism

no code implementations10 Jun 2022 Yang Zhao, Xuan Lin, Wenqiang Xu, Maozong Zheng, Zhengyong Liu, Zhou Zhao

In recent days, streaming technology has greatly promoted the development in the field of livestream.

Highlight Detection

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

1 code implementation5 Jun 2022 Ziyue Jiang, Zhe Su, Zhou Zhao, Qian Yang, Yi Ren, Jinglin Liu, Zhenhui Ye

This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language).

Polyphone disambiguation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

1 code implementation25 May 2022 Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao

Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e. g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism.

Representation Learning Speech Synthesis +2

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

2 code implementations15 May 2022 Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e. g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data.

Speech Synthesis Style Transfer +1

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

2 code implementations21 Apr 2022 Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time.

Ranked #7 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Denoising Speech Synthesis +2

Fine-Grained Predicates Learning for Scene Graph Generation

1 code implementation CVPR 2022 Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, Jingkuan Song

The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e. g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child".

Fine-Grained Image Classification Graph Generation +2

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

no code implementations17 Mar 2022 Dong Yao, Zhou Zhao, Shengyu Zhang, Jieming Zhu, Yudong Zhu, Rui Zhang, Xiuqiang He

We devise a novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music.

Contrastive Learning Cover song identification +2

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

no code implementations ACL 2022 Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, ShiLiang Pu, Fei Wu

To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner.

Descriptive Representation Learning +1

Learning the Beauty in Songs: Neural Singing Voice Beautifier

3 code implementations ACL 2022 Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao

Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one.

Dynamic Time Warping

Revisiting Over-Smoothness in Text to Speech

no code implementations ACL 2022 Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods.

VLAD-VSA: Cross-Domain Face Presentation Attack Detection with Vocabulary Separation and Adaptation

1 code implementation21 Feb 2022 Jiong Wang, Zhou Zhao, Weike Jin, Xinyu Duan, Zhen Lei, Baoxing Huai, Yiling Wu, Xiaofei He

In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability.

Face Presentation Attack Detection

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

no code implementations16 Feb 2022 Yi Ren, Ming Lei, Zhiying Huang, Shiliang Zhang, Qian Chen, Zhijie Yan, Zhou Zhao

Specifically, we first introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV).

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

no code implementations11 Jan 2022 Shoutong Wang, Jinglin Liu, Yi Ren, Zhen Wang, Changliang Xu, Zhou Zhao

However, they face several challenges: 1) the fixed-size speaker embedding is not powerful enough to capture full details of the target timbre; 2) single reference audio does not contain sufficient timbre information of the target speaker; 3) the pitch inconsistency between different speakers also leads to a degradation in the generated voice.

Singing Voice Synthesis

Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks

1 code implementation CVPR 2022 Wenwen Pan, Haonan Shi, Zhou Zhao, Jieming Zhu, Xiuqiang He, Zhigeng Pan, Lianli Gao, Jun Yu, Fei Wu, Qi Tian

Audio-Guided video semantic segmentation is a challenging problem in visual analysis and editing, which automatically separates foreground objects from background in a video sequence according to the referring audio expressions.

Denoising Segmentation +3

MLSLT: Towards Multilingual Sign Language Translation

no code implementations CVPR 2022 Aoxiong Yin, Zhou Zhao, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He

In addition, we also explore zero-shot translation in sign language and find that our model can achieve comparable performance to the supervised BSLT model on some language pairs.

Sign Language Translation Translation

Cross-Modal Background Suppression for Audio-Visual Event Localization

1 code implementation CVPR 2022 Yan Xia, Zhou Zhao

Audiovisual Event (AVE) localization requires the model to jointly localize an event by observing audio and visual information.

audio-visual event localization

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus

1 code implementation MM '21: Proceedings of the 29th ACM International Conference on Multimedia 2021 Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao

High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost.

Audio Generation Singing Voice Synthesis +1

SimulSLT: End-to-End Simultaneous Sign Language Translation

no code implementations8 Dec 2021 Aoxiong Yin, Zhou Zhao, Jinglin Liu, Weike Jin, Meng Zhang, Xingshan Zeng, Xiaofei He

Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years.

Sign Language Translation Translation

Generalizable Multi-linear Attention Network

no code implementations NeurIPS 2021 Tao Jin, Zhou Zhao

The majority of existing multimodal sequential learning methods focus on how to obtain effective representations and ignore the importance of multimodal fusion.

Multimodal Sentiment Analysis Retrieval +1

FedSpeech: Federated Text-to-Speech with Continual Learning

no code implementations14 Oct 2021 Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao

Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally.

Continual Learning Federated Learning

Multi-trends Enhanced Dynamic Micro-video Recommendation

no code implementations8 Oct 2021 Yujie Lu, Yingxuan Huang, Shengyu Zhang, Wei Han, Hui Chen, Zhou Zhao, Fei Wu

In this paper, we propose the DMR framework to explicitly model dynamic multi-trends of users' current preference and make predictions based on both the history and future potential trends.

Recommendation Systems

Stable Prediction on Graphs with Agnostic Distribution Shift

no code implementations8 Oct 2021 Shengyu Zhang, Kun Kuang, Jiezhong Qiu, Jin Yu, Zhou Zhao, Hongxia Yang, Zhongfei Zhang, Fei Wu

The results demonstrate that our method outperforms various SOTA GNNs for stable prediction on graphs with agnostic distribution shift, including shift caused by node labels and attributes.

Graph Learning Recommendation Systems

Test-time Batch Statistics Calibration for Covariate Shift

no code implementations6 Oct 2021 Fuming You, Jingjing Li, Zhou Zhao

An previous solution is test time normalization, which substitutes the source statistics in BN layers with the target batch statistics.

Domain Generalization Image Classification +3

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

3 code implementations NeurIPS 2021 Yi Ren, Jinglin Liu, Zhou Zhao

Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 and Glow-TTS can synthesize high-quality speech from the given text in parallel.

Text-To-Speech Synthesis Vocal Bursts Intensity Prediction +1

SynCLR: A Synthesis Framework for Contrastive Learning of out-of-domain Speech Representations

no code implementations29 Sep 2021 Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Zhou Zhao, Yi Ren

Learning generalizable speech representations for unseen samples in different domains has been a challenge with ever increasing importance to date.

Contrastive Learning Data Augmentation +4

ST-DDPM: Explore Class Clustering for Conditional Diffusion Probabilistic Models

no code implementations29 Sep 2021 Zhijie Lin, Zijian Zhang, Zhou Zhao

Score-based generative models involve sequentially corrupting the data distribution with noise and then learns to recover the data distribution based on score matching.

Clustering Conditional Image Generation

PhaseFool: Phase-oriented Audio Adversarial Examples via Energy Dissipation

no code implementations29 Sep 2021 Ziyue Jiang, Yi Ren, Zhou Zhao

In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Why Do We Click: Visual Impression-aware News Recommendation

1 code implementation26 Sep 2021 Jiahao Xun, Shengyu Zhang, Zhou Zhao, Jieming Zhu, Qi Zhang, Jingjie Li, Xiuqiang He, Xiaofei He, Tat-Seng Chua, Fei Wu

In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation.

Decision Making News Recommendation

CauseRec: Counterfactual User Sequence Synthesis for Sequential Recommendation

no code implementations11 Sep 2021 Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, Fei Wu

In this paper, we propose to learn accurate and robust user representations, which are required to be less sensitive to (attack on) noisy behaviors and trust more on the indispensable ones, by modeling counterfactual data distribution.

counterfactual Representation Learning +1

SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory

no code implementations31 Aug 2021 Zhijie Lin, Zhou Zhao, Haoyuan Li, Jinglin Liu, Meng Zhang, Xingshan Zeng, Xiaofei He

Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios.

Lip Reading

Parallel and High-Fidelity Text-to-Lip Generation

1 code implementation14 Jul 2021 Jinglin Liu, Zhiying Zhu, Yi Ren, Wencan Huang, Baoxing Huai, Nicholas Yuan, Zhou Zhao

However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation.

Talking Face Generation Text-to-Face Generation +1

Cascaded Prediction Network via Segment Tree for Temporal Video Grounding

no code implementations CVPR 2021 Yang Zhao, Zhou Zhao, Zhu Zhang, Zhijie Lin

Temporal video grounding aims to localize the target segment which is semantically aligned with the given sentence in an untrimmed video.

Sentence Video Grounding

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

no code implementations CVPR 2021 Yawen Zeng, Da Cao, Xiaochi Wei, Meng Liu, Zhou Zhao, Zheng Qin

Toward this end, we contribute a multi-modal relational graph to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates.

Cross-Modal Retrieval Graph Matching +4

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

no code implementations17 Jun 2021 Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao

Finally, by showing a comparable performance in the emotional speech synthesis task, we successfully demonstrate the ability of the proposed model.

Emotional Speech Synthesis Emotion Classification

Learning to Rehearse in Long Sequence Memorization

no code implementations2 Jun 2021 Zhu Zhang, Chang Zhou, Jianxin Ma, Zhijie Lin, Jingren Zhou, Hongxia Yang, Zhou Zhao

Further, we design a history sampler to select informative fragments for rehearsal training, making the memory focus on the crucial information.

Memorization Question Answering +1

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

7 code implementations6 May 2021 Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao

Singing voice synthesis (SVS) systems are built to synthesize high-quality and expressive singing voice, in which the acoustic model generates the acoustic features (e. g., mel-spectrogram) given a music score.

Generative Adversarial Network Singing Voice Synthesis +1

Cortical Surface Shape Analysis Based on Alexandrov Polyhedra

no code implementations ICCV 2021 Min Zhang, Yang Guo, Na lei, Zhou Zhao, Jianfeng Wu, Xiaoyin Xu, Yalin Wang, Xianfeng GU

Shape analysis has been playing an important role in early diagnosis and prognosis of neurodegenerative diseases such as Alzheimer's diseases (AD).

To Learn Effective Features: Understanding the Task-Specific Adaptation of MAML

no code implementations1 Jan 2021 Zhijie Lin, Zhou Zhao, Zhu Zhang, Huai Baoxing, Jing Yuan

Model Agnostic Meta-Learning~(MAML)~(\cite{finn2017model}) is one of the most well-known gradient-based meta learning algorithms, that learns the meta-initialization through the inner and outer optimization loop.

Contrastive Learning Meta-Learning

Continual Memory: Can We Reason After Long-Term Memorization?

no code implementations1 Jan 2021 Zhu Zhang, Chang Zhou, Zhou Zhao, Zhijie Lin, Jingren Zhou, Hongxia Yang

Existing reasoning tasks often follow the setting of "reasoning while experiencing", which has an important assumption that the raw contents can be always accessed while reasoning.

Memorization

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

no code implementations NeurIPS 2020 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, Xiuqiang He

Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training.

Contrastive Learning counterfactual +2

Future-Aware Diverse Trends Framework for Recommendation

1 code implementation1 Nov 2020 Yujie Lu, Shengyu Zhang, Yingxuan Huang, Luyao Wang, Xinyao Yu, Zhou Zhao, Fei Wu

By diverse trends, supposing the future preferences can be diversified, we propose the diverse trends extractor and the time-aware mechanism to represent the possible trends of preferences for a given user with multiple vectors.

Representation Learning Sequential Recommendation

MGD-GAN: Text-to-Pedestrian generation through Multi-Grained Discrimination

no code implementations2 Oct 2020 Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu

In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance.

Generative Adversarial Network Image Generation

PopMAG: Pop Music Accompaniment Generation

1 code implementation18 Aug 2020 Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks.

Music Modeling

Poet: Product-oriented Video Captioner for E-commerce

1 code implementation16 Aug 2020 Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu

Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics.

Video Captioning

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

no code implementations16 Aug 2020 Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan

Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence.

Object Relation +4

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

1 code implementation16 Aug 2020 Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, Fei Wu

In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned.

Image Retrieval Question Answering +2

FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

no code implementations6 Aug 2020 Jinglin Liu, Yi Ren, Zhou Zhao, Chen Zhang, Baoxing Huai, Nicholas Jing Yuan

NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading.

Language Modelling Lipreading

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

no code implementations9 Jul 2020 Yi Ren, Xu Tan, Tao Qin, Jian Luan, Zhou Zhao, Tie-Yan Liu

DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers.

Sentence Singing Voice Synthesis

SimulSpeech: End-to-End Simultaneous Speech to Text Translation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, Tie-Yan Liu

In this work, we develop SimulSpeech, an end-to-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +7

Comprehensive Information Integration Modeling Framework for Video Titling

1 code implementation24 Jun 2020 Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume.

Descriptive Video Captioning

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

32 code implementations ICLR 2021 Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e. g., pitch, energy and more accurate duration) as conditional inputs.

Ranked #6 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Knowledge Distillation Speech Synthesis +1

A Study of Non-autoregressive Model for Sequence Generation

no code implementations ACL 2020 Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, Tie-Yan Liu

In this work, we conduct a study to understand the difficulty of NAR sequence generation and try to answer: (1) Why NAR models can catch up with AR models in some tasks but not all?

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

A Generic Network Compression Framework for Sequential Recommender Systems

1 code implementation21 Apr 2020 Yang Sun, Fajie Yuan, Min Yang, Guoao Wei, Zhou Zhao, Duo Liu

Current state-of-the-art sequential recommender models are typically based on a sandwich-structured deep neural network, where one or more middle (hidden) layers are placed between the input embedding layer and output softmax layer.

Sequential Recommendation

Grounded and Controllable Image Completion by Incorporating Lexical Semantics

no code implementations29 Feb 2020 Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang Tang, Jin Yu, Hongxia Yang, Yi Yang, Fei Wu

Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge.

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization

2 code implementations31 Jan 2020 Shuwen Xiao, Zhou Zhao, Zijian Zhang, Xiaohui Yan, Min Yang

This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs and aims to generate a query-focused video summary.

Video Summarization

Bi-Decoder Augmented Network for Neural Machine Translation

no code implementations14 Jan 2020 Boyuan Pan, Yazheng Yang, Zhou Zhao, Yueting Zhuang, Deng Cai

Neural Machine Translation (NMT) has become a popular technology in recent years, and the encoder-decoder framework is the mainstream among all the methods.

Machine Translation NMT +1

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

no code implementations19 Nov 2019 Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi. Wang, Huasheng Liu

Video moment retrieval is to search the moment that is most relevant to the given natural language query.

Moment Retrieval Retrieval +2

Video Dialog via Progressive Inference and Cross-Transformer

no code implementations IJCNLP 2019 Weike Jin, Zhou Zhao, Mao Gu, Jun Xiao, Furu Wei, Yueting Zhuang

Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history.

Answer Generation Question Answering +4

Personalized Hashtag Recommendation for Micro-videos

1 code implementation27 Aug 2019 Yinwei Wei, Zhiyong Cheng, Xuzheng Yu, Zhou Zhao, Lei Zhu, Liqiang Nie

The hashtags, that a user provides to a post (e. g., a micro-video), are the ones which in her mind can well describe the post content where she is interested in.

Weak Supervision Enhanced Generative Network for Question Generation

no code implementations1 Jul 2019 Yutong Wang, Jiyuan Zheng, Qijiong Liu, Zhou Zhao, Jun Xiao, Yueting Zhuang

More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system.

Question Answering Question Generation +1

Localizing Unseen Activities in Video via Image Query

no code implementations28 Jun 2019 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Deng Cai

Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization.

Action Localization Video Understanding

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

no code implementations28 Jun 2019 Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Xiaofei He

Concretely, we first develop a hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context.

Answer Generation Question Answering +1

Beyond Product Quantization: Deep Progressive Quantization for Image Retrieval

1 code implementation16 Jun 2019 Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, Heng Tao Shen

In this work, we propose a deep progressive quantization (DPQ) model, as an alternative to PQ, for large scale image retrieval.

Image Retrieval Quantization +1

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

1 code implementation6 Jun 2019 Zhu Zhang, Zhijie Lin, Zhou Zhao, Zhenxin Xiao

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.

Moment Retrieval Natural Language Queries +2

FastSpeech: Fast,Robustand Controllable Text-to-Speech

11 code implementations22 May 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i. e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).

Text-To-Speech Synthesis

FastSpeech: Fast, Robust and Controllable Text to Speech

21 code implementations NeurIPS 2019 Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.

Ranked #10 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Speech Synthesis Text-To-Speech Synthesis

Almost Unsupervised Text to Speech and Automatic Speech Recognition

no code implementations13 May 2019 Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Multilingual Neural Machine Translation with Knowledge Distillation

1 code implementation ICLR 2019 Xu Tan, Yi Ren, Di He, Tao Qin, Zhou Zhao, Tie-Yan Liu

Multilingual machine translation, which translates multiple languages with a single model, has attracted much attention due to its efficiency of offline training and online serving.

Knowledge Distillation Machine Translation +1

Improving Automatic Source Code Summarization via Deep Reinforcement Learning

2 code implementations17 Nov 2018 Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu

To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization, b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given.

Code Summarization reinforcement-learning +3

Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training

no code implementations12 Nov 2018 Yao Wan, Wenqiang Yan, Jianwei Gao, Zhou Zhao, Jian Wu, Philip S. Yu

Dialogue Act (DA) classification is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.

Classification Dialogue Act Classification +3

Textually Guided Ranking Network for Attentional Image Retweet Modeling

no code implementations24 Oct 2018 Zhou Zhao, Hanbing Zhan, Lingtao Meng, Jun Xiao, Jun Yu, Min Yang, Fei Wu, Deng Cai

In this paper, we study the problem of image retweet prediction in social media, which predicts the image sharing behavior that the user reposts the image tweets from their followees.

Dialogue Act Recognition via CRF-Attentive Structured Network

no code implementations SIGIR 2018 Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, Xiaofei He

Dialogue Act Recognition (DAR) is a challenging problem in dialogue interpretation, which aims to attach semantic labels to utterances and characterize the speaker's intention.

Dialogue Act Classification Dialogue Interpretation +1

Keyword-based Query Comprehending via Multiple Optimized-Demand Augmentation

no code implementations1 Nov 2017 Boyuan Pan, Hao Li, Zhou Zhao, Deng Cai, Xiaofei He

In this paper, we propose a novel neural network system that consists a Demand Optimization Model based on a passage-attention neural machine translation and a Reader Model that can find the answer given the optimized question.

Machine Reading Comprehension Machine Translation +2

Smarnet: Teaching Machines to Read and Comprehend Like Human

no code implementations8 Oct 2017 Zheqian Chen, Rongqin Yang, Bin Cao, Zhou Zhao, Deng Cai, Xiaofei He

Machine Comprehension (MC) is a challenging task in Natural Language Processing field, which aims to guide the machine to comprehend a passage and answer the given question.

Question Answering Reading Comprehension +1

Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters

no code implementations EMNLP 2017 Min Yang, Jincheng Mei, Heng Ji, Wei Zhao, Zhou Zhao, Xiaojun Chen

We study the problem of identifying the topics and sentiments and tracking their shifts from social media texts in different geographical regions during emergencies and disasters.

Topic Models

Video Question Answering via Attribute-Augmented Attention Network Learning

no code implementations20 Jul 2017 Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, Yueting Zhuang

Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question.

Attribute Information Retrieval +6

The Forgettable-Watcher Model for Video Question Answering

no code implementations3 May 2017 Hongyang Xue, Zhou Zhao, Deng Cai

Then we propose a TGIF-QA dataset for video question answering with the help of automatic question generation.

Question Answering Question Generation +3

Question Retrieval for Community-based Question Answering via Heterogeneous Network Integration Learning

no code implementations24 Nov 2016 Zheqian Chen, Chi Zhang, Zhou Zhao, Deng Cai

The challenges in this task are the lexical gaps between questions for the word ambiguity and word mismatch problem.

Question Answering Retrieval

Cannot find the paper you are looking for? You can Submit a new open access paper.