paint-brush

This story draft by @escholar has not been reviewed by an editor, YET.

High-diversity but High-fidelity Speech Synthesis

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Table of Links

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.7 High-diversity but High-fidelity Speech Synthesis

Following Glow-TTS [34], speech with different styles can be synthesized by controlling the temperature parameters in the TTV and hierarchical speech synthesizer. TABLE 6 shows that lower temperatures ensure the robustness of the synthetic speech in terms of pronunciation. However, the diversity and speaker similarity can be increased by controlling the temperature. Specifically, we found that increasing Tttv improved the similarity of prosody, such as intonation and pronunciation to target prosody prompts and increasing Th improved the similarity of voice style in terms of SECS. However, when the value of Tttv is close to 1, the CER and WER are decreased; therefore, we utilized a value under 1 for robust speech synthesis. In addition, we can synthesize speech differently with different Gaussian noises, and control the duration by multiplying the duration by a specific value.


TABLE 7: Zero-shot TTS results with noisy prompt on unseen speakers from the test-clean subset of LibriTTS. We synthesizeall sentences of subset (4,837 samples). For HierSpeech++, we only utilize the text sequences from LibriTTS train-960.


TABLE 8: Zero-shot TTS results with very noisy prompt on unseen speakers from the test-other subset of LibriTTS. Wesynthesize all sentences of test-other subset (5,120 samples).


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...