Search Results for author: Sefik Emre Eskimez

Found 20 papers, 2 papers with code

Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

no code implementations12 Feb 2024 Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, Jinzhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression.

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

no code implementations14 Aug 2023 Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech.

Language Modelling Multi-Task Learning +2

Real-Time Audio-Visual End-to-End Speech Enhancement

no code implementations13 Mar 2023 Zirun Zhu, Hemin Yang, Min Tang, ZiYi Yang, Sefik Emre Eskimez, Huaming Wang

In this paper, we propose a low-latency real-time audio-visual end-to-end enhancement (AV-E3Net) model based on the recently proposed end-to-end enhancement network (E3Net).

Speech Enhancement Task 2

Speech separation with large-scale self-supervised learning

no code implementations9 Nov 2022 Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, Jinyu Li, Sunit Sivasankaran, Sefik Emre Eskimez

Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.

Self-Supervised Learning Speech Separation

Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

no code implementations5 Nov 2022 Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka

This prevents the PSE model from being too aggressive while still allowing the model to learn to suppress the input speech when it is likely to be spoken by interfering speakers.

Knowledge Distillation Speech Enhancement

Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation

no code implementations4 Nov 2022 Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang

Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices.

Acoustic echo cancellation Multi-Task Learning +1

Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

no code implementations7 Apr 2022 Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data.

Speech Separation

ICASSP 2022 Deep Noise Suppression Challenge

1 code implementation27 Feb 2022 Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner

We open-source datasets and test sets for researchers to train their deep noise suppression models, as well as a subjective evaluation framework based on ITU-T P. 835 to rate and rank-order the challenge entries.

One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

no code implementations20 Oct 2021 Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Zhuo Chen, Xuedong Huang

Experimental results show that the proposed geometry agnostic model outperforms the model trained on a specific microphone array geometry in both speech quality and automatic speech recognition accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Personalized Speech Enhancement: New Models and Comprehensive Evaluation

no code implementations18 Oct 2021 Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, Xuedong Huang

Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving the speech recognition accuracy.

Speech Enhancement speech-recognition +1

All-neural beamformer for continuous speech separation

no code implementations13 Oct 2021 Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez

Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Dynamic Gradient Aggregation for Federated Domain Adaptation

no code implementations14 Jun 2021 Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

The proposed scheme is based on a weighted gradient aggregation using two-step optimization to offer a flexible training pipeline.

Domain Adaptation Federated Learning +3

Improving Readability for Automatic Speech Recognition Transcription

no code implementations9 Apr 2020 Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi, Ming Gong, Linjun Shou, Hong Qu, Michael Zeng

In this work, we propose a novel NLP task called ASR post-processing for readability (APR) that aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +2

Generating Talking Face Landmarks from Speech

no code implementations26 Mar 2018 Sefik Emre Eskimez, Ross K. Maddox, Chenliang Xu, Zhiyao Duan

In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time.

Cannot find the paper you are looking for? You can Submit a new open access paper.