TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Motion Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	Mean Opinion Score	3.44	# 1
Text-To-Speech Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	MOS	3.7	# 1
Audio Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	WER	8.85	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-speech-and-gesture-synthesis-using/motion-synthesis-on-trinity-speech-gesture)](https://paperswithcode.com/sota/motion-synthesis-on-trinity-speech-gesture?p=unified-speech-and-gesture-synthesis-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-speech-and-gesture-synthesis-using/text-to-speech-synthesis-on-trinity-speech)](https://paperswithcode.com/sota/text-to-speech-synthesis-on-trinity-speech?p=unified-speech-and-gesture-synthesis-using)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unified-speech-and-gesture-synthesis-using/audio-synthesis-on-trinity-speech-gesture)](https://paperswithcode.com/sota/audio-synthesis-on-trinity-speech-gesture?p=unified-speech-and-gesture-synthesis-using)`

Unified speech and gesture synthesis using flow matching

8 Oct 2023 · Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter ·

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks. Please see https://shivammehta25.github.io/Match-TTSG/ for video examples and code.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Audio Synthesis

Motion Synthesis

Text-To-Speech Synthesis

Datasets

Trinity Speech-Gesture Dataset

Results from the Paper

Edit

Ranked #1 on Motion Synthesis on Trinity Speech-Gesture Dataset

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Motion Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	Mean Opinion Score	3.44	# 1	Compare
Text-To-Speech Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	MOS	3.7	# 1	Compare
Audio Synthesis	Trinity Speech-Gesture Dataset	Match-TTSG	WER	8.85	# 1	Compare

Methods

Add Remove

Flow Matching

Edit Social Preview

Unified speech and gesture synthesis using flow matching

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove