2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
We compared the zero-shot TTS performance of HierSpeech++ with other baselines: 1) YourTTS, VITS-based end-to-end TTS model, 2) HierSpeech, an end-to-end TTS model using hierarchical VAE, 3) VALL-E-X, a neural codec language models-based multi-lingual zero-shot TTS model and we utilize an unofficial implementation which has an improved audio quality with Vocos decoder, 4) XTTS[12] , a TTS product XTTS v1 from Coqui Corp., and XTTS is built on a open-source TTS model, TorToise [5] which was trained with unprecedented large-scale speech dataset for the first time. For zero-shot TTS, we utilized a noisy speech prompt from the test-clean and test-other subsets of LibriTTS. HierSpeech++ synthesizes the speech with Tttv of 0.333 and Th of 0.333 in TABLE 7 and 8.
The results demonstrate that our model is a strong zeroshot TTS model in terms of all subjective and objective metrics. We conducted three MOS experiments for naturalness, prosody, and similarity. Our model beats all models significantly, and our model has even surpassed the groundtruth in terms of naturalness. However, XTTS has a better performance in pMOS, and this means learning prosody requires more datasets to improve expressiveness. Although other models show limitations in synthesizing speech with noisy prompts, our model synthesizes a speech robustly. Furthermore, our model has a better CER and WER than ground-truth, and this also demonstrates the robustness of our model. In summary, all results demonstrate the superiority of our model in naturalness, expressiveness, and robustness for zero-shot TTS.
obustness for zero-shot TTS. In addition, we could further improve the zero-shot TTS performance by introducing a style prompt replication (SPR) in the following subsection. Note that we do not apply the SPR in TABLE 2-8. The audio could be upsampled to 48 kHz. Lastly, we could also synthesize noise-free speech even with noisy speech. The details will be described in Section 6.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
[12]. https://github.com/coqui-ai/TTS
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.