This work presents a new database containing high sampling rate recordings of a single male speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for synthesis and other speech-oriented applications, the dataset contains text scripts extracted from a popular Brazilian news TV program, read out loud by a trained individual in a controlled environment, resulting in roughly 20 h of audio data. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples. To illustrate the potential benefits of having this data available, text-to-speech experiments were conducted using state-of-the-art models for speech synthesis (Tacotron 2 and Waveglow). As a result, we obtained intelligible and natural sounding voices from as few as 8 min of audio samples coming from an unseen target speaker, after having trained over our data; moreover, by increasing the target recording time to 75 min, we have noticeably improved accuracy in pronunciation.

PDF
No code implementations yet. Submit your code now

Datasets


Introduced in the Paper:

GneutralSpeech Male

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here