This story draft by @escholar has not been reviewed by an editor, YET.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.
2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
Following Glow-TTS [34], speech with different styles can be synthesized by controlling the temperature parameters in the TTV and hierarchical speech synthesizer. TABLE 6 shows that lower temperatures ensure the robustness of the synthetic speech in terms of pronunciation. However, the diversity and speaker similarity can be increased by controlling the temperature. Specifically, we found that increasing Tttv improved the similarity of prosody, such as intonation and pronunciation to target prosody prompts and increasing Th improved the similarity of voice style in terms of SECS. However, when the value of Tttv is close to 1, the CER and WER are decreased; therefore, we utilized a value under 1 for robust speech synthesis. In addition, we can synthesize speech differently with different Gaussian noises, and control the duration by multiplying the duration by a specific value.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.