High-diversity but High-fidelity Speech Synthesis

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.7 High-diversity but High-fidelity Speech Synthesis

Following Glow-TTS [34], speech with different styles can be synthesized by controlling the temperature parameters in the TTV and hierarchical speech synthesizer. TABLE 6 shows that lower temperatures ensure the robustness of the synthetic speech in terms of pronunciation. However, the diversity and speaker similarity can be increased by controlling the temperature. Specifically, we found that increasing Tttv improved the similarity of prosody, such as intonation and pronunciation to target prosody prompts and increasing Th improved the similarity of voice style in terms of SECS. However, when the value of Tttv is close to 1, the CER and WER are decreased; therefore, we utilized a value under 1 for robust speech synthesis. In addition, we can synthesize speech differently with different Gaussian noises, and control the duration by multiplying the duration by a specific value.