2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
We utilized a MMS (0.3B) [65] which is a pre-trained Wav2Vec 2.0 model with a massively larget-scale cross-lingual speech dataset containing speech dataset of 1000 languages. To map the semantic and acoustic representations, we used a hop size of 320 to extract a linear spectrogram. For the style encoder, we utilized a Mel-spectrogram with 80 bins. We extracted F0 using YAPPT [30] algorithm with a hop of 80. For phoneme transformation, we utilized an International Phonetic Alphabet (IPA) sequence with an open-source Phonemizer [4]. Following [48], we did not utilize blank tokens for the target phoneme sequences of the CTC decoder. However, we used blank tokens for the input phoneme sequences.
For reproducibility, we will release the source code and the details of all the hyperparameters will be included at the https://github.com/sh-lee-prml/HierSpeechpp. We trained HierSpeech++ using the AdamW optimizer [57] with β1 = 0.8, β2 = 0.99, and weight decay λ = 0.01, and we apply the learning rate schedule by the decay of 0.9991/8 at an initial learning rate of 1 × 10−4 for the HierSpeech++ with LibriTTS dataset and a batch size of 80 for 1,200k steps on four NVIDIA A6000 GPUs. The final model which is trained with the entire dataset continued to train from HierSpeech++ that was trained with LibriTTS, and the decay decreased by 0.999. We trained HierSpeech++ with a batch size of 160 for 1,000k steps on eight NVIDIA A6000 GPUs. For the ablation study, the models were trained with a batch size of 80 for 300k steps. We sliced the audio using 61,440 frames for efficient training, and we used windowed generator training for the generator using 9,600 frames. HierSpeech++ consists of 63M parameters for inference and additional 34M parameters only for training. For TTV, we trained the model using the AdamW optimizer [57] with β1 = 0.8, β2 = 0.99, and weight decay λ = 0.01, and apply the learning rate schedule by the decay of 0.999 at an initial learning rate of 2 × 10−4 with a batch size of 128 for 950k steps on four NVIDIA A100 GPUs. TTV consists of 107M For SpeechSR, we utilize the same configuration as BigVGAN, and trained the model with a batch size of 128 for 100k steps on four NVIDIA A6000 GPUs.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.