Search Results for author: Jeongsoo Choi

Found 12 papers, 5 papers with code

Multilingual Visual Speech Recognition with a Single Model by Learning with Discrete Visual Speech Units

no code implementations • 18 Jan 2024 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Se Jin Park, Yong Man Ro

By using the visual speech units as the inputs of our system, we pre-train the model to predict corresponding text outputs on massive multilingual data constructed by merging several VSR databases.

Sentence speech-recognition +1

Paper
Add Code

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

1 code implementation • 5 Dec 2023 • Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A.

Self-Supervised Learning Speech-to-Speech Translation +1

Paper
Code

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

no code implementations • 15 Sep 2023 • Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp.

Image Comprehension Language Modelling +1

Paper
Add Code

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

no code implementations • ICCV 2023 • Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units.

Lip Reading

Paper
Add Code

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

1 code implementation • ICCV 2023 • Jeongsoo Choi, Joanna Hong, Yong Man Ro

In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time.

Speech Synthesis

Paper
Code

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

no code implementations • 15 Aug 2023 • Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements.

Quantization speech-recognition +1

Paper
Add Code

Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation

1 code implementation • 3 Aug 2023 • Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST).

Representation Learning Speech-to-Speech Translation +4

Paper
Code

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

no code implementations • 28 Jun 2023 • Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio.

Face Generation

Paper
Add Code

Intelligible Lip-to-Speech Synthesis with Speech Units

1 code implementation • 31 May 2023 • Jeongsoo Choi, Minsu Kim, Yong Man Ro

Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units.

Lip to Speech Synthesis Speech Synthesis

Paper
Code

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

no code implementations • 31 May 2023 • Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion.

Talking Face Generation

Paper
Add Code

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

1 code implementation • CVPR 2023 • Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models.

Audio-Visual Speech Recognition speech-recognition +1

Paper
Code

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

no code implementations • 2 Nov 2022 • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time.

Audio-Visual Synchronization Representation Learning +1

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.