2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
For TTS, we introduce a text-to-vec (TTV) model that generates a semantic representation and F0 from a text sequence. Following VITS [35], we utilize a variational autoencoder and a monotonic alignment search (MAS) to align the text and speech internally, as shown in Fig 4. We replace the linear spectrogram with a self-supervised speech representation for the input of posterior encoder, and we reconstruct the same self-supervised speech representation for the output of TTV. Furthermore, we predict a F0 with four× larger resolutions than the self-supervised speech representation. We use a text sequence and prosody prompt as conditional information to generate a self-supervised speech representation of the data. We utilize a prosody conditional text representation as the prior information. A prosody style representation is extracted from the full-length input speech as a global style embedding. Owing to the semantic information of self supervised speech representation, we can transfer a prosody style in the TTV framework which is almost irrelevant to the voice style. To increase the linguistic capacity of a semantic representation, a latent representation is fed to the phoneme encoder, and the connectionist temporal classification (CTC) loss is minimized. We found that this could improve text speech alignment by significantly decreasing the CER and WER of synthetic speech. Furthermore, we use a Transformer based normalizing flow with AdaLN-Zero for better prosody adaptation.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.