2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
We compared the zero-shot TTS performance of our model with Vall-E, NaturalSpeech 2, and StyleTTS 2. Because there are no official implementations of them, we utilize the demo samples in NaturalSpeech 2 and StyleTTS 2. We only compared four samples for this experiment. We also added the audio samples to the demo pages. For naturalness, we utilized UTMOS and our model shows a significantly higher score than others. We also compared the similarity with prompt and GT. TABLE 11 shows that our model has much higher similarity with prompts than others. However, our model has a lower similarity with GT than others. We found that SECS between prompt and GT also shows a low similarity in these samples and this means the prompt and GT have a slightly different style even from the same speaker.
In addition, we only utilize four samples. Meanwhile, we thought that the similarity between prompt and generated speech is more important for zero-shot speech synthesis.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.