2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
For the reconstruction and resynthesis tasks, we conducted seven objective metrics: a log-scale Mel error distance (Mel), perceptual evaluation of speech quality (PESQ)[5] , Pitch, periodicity (Period.), voice/unvoice (V/UV) F1 score, and logscale F0 consistency F0c. We used the official implementation of CARGAN [60] for pitch, periodicity, and U/UV F1[6] . For F0c, we calculated the L1 distance between the log-scale ground-truth and the predicted F0 in the HAG.
For VC, we used two subjective metrics: naturalness mean opinion score (nMOS) and voice similarity MOS (sMOS) with a CI of 95%; and three objective metrics for naturalness: UTMOS [69], character error rate (CER) and word error rate (WER); two objective metrics for similarity: automatic speaker verification equal error rate (EER), and speaker encoder cosine similarity (SECS). We utilized the open-source UTMOS[7] which is an MOS prediction model for a naturalness metric. Although this can not be considered an absolute evaluation metric, we believe that it is a simple way to estimate the audio quality of synthetic speech. Additionally, this method does not require ground-truth audio or labels to estimate the score. Therefore, we highly recommend using this simple metric during validation by adding a single line. For CER and WER, we utilized the Whisper’s official implementation. We used a large model with 1,550 M parameters and calculated the CER and WER after text normalization, as presented in the official implementation. We utilized a pre-trained automatic speaker verification models [42][8] which was trained with a large-scale speech dataset, VoxCeleb2 [14]. In [13], the effectiveness of metric learning in automatic speaker verification was demonstrated. Furthermore, [42] introduced online data augmentation, which decreased the EER from 2.17% to 1.17%. In addition, we utilized the pre-trained speaker encoder, Resemblyzer [9] to extract a speaker representation, and we calculated the cosine similarity between the speaker representation of the target speech and synthetic speech.
For TTS, we additionally utilize a prosody MOS (pMOS). Sixty samples were randomly selected for each model. The nMOS was rated by 10 listeners on a scale of 1-5, and the sMOS and pMOS were rated by 10 listeners on a scale of 1-4. A confidence interval of 95% was reported for MOS.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
[5] https://github.com/ludlows/PESQ
[6] https://github.com/descriptinc/cargan
[7] https://github.com/tarepan/SpeechMOS
[8] https://github.com/clovaai/voxceleb_trainer
[9] https://github.com/resemble-ai/Resemblyzer
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.