2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
Fig 6 illustrates the entire inference pipeline. For voice conversion, we first extract the semantic representation by MMS from the audio at 16 kHz, and F0 using the YAPPT algorithm. Before feeding F0 to Hierarchical Synthesizer, we normalize F0 using the mean and standard deviation of the source speech. Then, we denormalize a normalized F0 by the mean and standard deviation of the target speech. The speech synthesizer synthesizes 16 kHz speech with a target voice style from the target voice prompt. The SpeechSR can upsample the synthesized speech to a high-resolution speech of 48 kHz. For a fair comparison, we do not utilize SpeechSR to evaluate the VC performance.
For text-to-speech, we extract semantic representations from text instead of speech. The TTV can generate a semantic representation with the target prosody from the prosody prompt. The hierarchical speech synthesizer generates speech from semantic representations, and the SpeechSR can upsample it to a high-resolution from 16 kHz to 48 kHz. For a fair comparison, SpeechSR was not used during the TTS evaluation. The prosody and voice styles can be transferred from different target prompts, respectively.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.