Table of Links Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 5.8 Zero-shot Text-to-Speech We compared the zero-shot TTS performance of HierSpeech++ with other baselines: 1) YourTTS, VITS-based end-to-end TTS model, 2) HierSpeech, an end-to-end TTS model using hierarchical VAE, 3) VALL-E-X, a neural codec language models-based multi-lingual zero-shot TTS model and we utilize an unofficial implementation which has an improved audio quality with Vocos decoder, 4) XTTS[12] , a TTS product XTTS v1 from Coqui Corp., and XTTS is built on a open-source TTS model, TorToise [5] which was trained with unprecedented large-scale speech dataset for the first time. For zero-shot TTS, we utilized a noisy speech prompt from the test-clean and test-other subsets of LibriTTS. HierSpeech++ synthesizes the speech with Tttv of 0.333 and Th of 0.333 in TABLE 7 and 8. The results demonstrate that our model is a strong zeroshot TTS model in terms of all subjective and objective metrics. We conducted three MOS experiments for naturalness, prosody, and similarity. Our model beats all models significantly, and our model has even surpassed the groundtruth in terms of naturalness. However, XTTS has a better performance in pMOS, and this means learning prosody requires more datasets to improve expressiveness. Although other models show limitations in synthesizing speech with noisy prompts, our model synthesizes a speech robustly. Furthermore, our model has a better CER and WER than ground-truth, and this also demonstrates the robustness of our model. In summary, all results demonstrate the superiority of our model in naturalness, expressiveness, and robustness for zero-shot TTS. obustness for zero-shot TTS. In addition, we could further improve the zero-shot TTS performance by introducing a style prompt replication (SPR) in the following subsection. Note that we do not apply the SPR in TABLE 2-8. The audio could be upsampled to 48 kHz. Lastly, we could also synthesize noise-free speech even with noisy speech. The details will be described in Section 6. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. [12]. https://github.com/coqui-ai/TTS Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning 3 Hierspeech++ and 3.1 Speech Representations 3 Hierspeech++ and 3.1 Speech Representations 3.2 Hierarchical Speech Synthesizer 3.2 Hierarchical Speech Synthesizer 3.3 Text-to-Vec 3.3 Text-to-Vec 3.4 Speech Super-resolution 3.4 Speech Super-resolution 3.5 Model Architecture 3.5 Model Architecture 4 Speech Synthesis Tasks 4.1 Voice Conversion and 4.2 Text-to-Speech 4.1 Voice Conversion and 4.2 Text-to-Speech 4.3 Style Prompt Replication 4.3 Style Prompt Replication 5 Experiment and Result, and Dataset 5 Experiment and Result, and Dataset 5.2 Preprocessing and 5.3 Training 5.2 Preprocessing and 5.3 Training 5.4 Evaluation Metrics 5.4 Evaluation Metrics 5.5 Ablation Study 5.5 Ablation Study 5.6 Zero-shot Voice Conversion 5.6 Zero-shot Voice Conversion 5.7 High-diversity but High-fidelity Speech Synthesis 5.7 High-diversity but High-fidelity Speech Synthesis 5.8 Zero-shot Text-to-Speech 5.8 Zero-shot Text-to-Speech 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.9 Zero-shot Text-to-Speech with 1s Prompt 5.10 Speech Super-resolution 5.10 Speech Super-resolution 5.11 Additional Experiments with Other Baselines 5.11 Additional Experiments with Other Baselines 6 Limitation and Quick Fix 6 Limitation and Quick Fix 7 Conclusion, Acknowledgement and References 7 Conclusion, Acknowledgement and References 5.8 Zero-shot Text-to-Speech We compared the zero-shot TTS performance of HierSpeech++ with other baselines: 1) YourTTS, VITS-based end-to-end TTS model, 2) HierSpeech, an end-to-end TTS model using hierarchical VAE, 3) VALL-E-X, a neural codec language models-based multi-lingual zero-shot TTS model and we utilize an unofficial implementation which has an improved audio quality with Vocos decoder, 4) XTTS[12] , a TTS product XTTS v1 from Coqui Corp., and XTTS is built on a open-source TTS model, TorToise [5] which was trained with unprecedented large-scale speech dataset for the first time. For zero-shot TTS, we utilized a noisy speech prompt from the test-clean and test-other subsets of LibriTTS. HierSpeech++ synthesizes the speech with Tttv of 0.333 and Th of 0.333 in TABLE 7 and 8. The results demonstrate that our model is a strong zeroshot TTS model in terms of all subjective and objective metrics. We conducted three MOS experiments for naturalness, prosody, and similarity. Our model beats all models significantly, and our model has even surpassed the groundtruth in terms of naturalness. However, XTTS has a better performance in pMOS, and this means learning prosody requires more datasets to improve expressiveness. Although other models show limitations in synthesizing speech with noisy prompts, our model synthesizes a speech robustly. Furthermore, our model has a better CER and WER than ground-truth, and this also demonstrates the robustness of our model. In summary, all results demonstrate the superiority of our model in naturalness, expressiveness, and robustness for zero-shot TTS. obustness for zero-shot TTS. In addition, we could further improve the zero-shot TTS performance by introducing a style prompt replication (SPR) in the following subsection. Note that we do not apply the SPR in TABLE 2-8. The audio could be upsampled to 48 kHz. Lastly, we could also synthesize noise-free speech even with noisy speech. The details will be described in Section 6. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license. available on arxiv [12]. https://github.com/coqui-ai/TTS Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author. Authors: Authors: (1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea; (4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Zero-shot Text-to-Speech: How Does the Performance of HierSpeech++ Fare With Other Baselines?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Close Look at Misalignment in Pretraining Datasets

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

A Close Look at Misalignment in Pretraining Datasets

Neural Codec Language Models and Non-Autoregressive Models Explained

Introducing Hierspeech++: A Human-Level Zeroshot Speech Synthesis Model

The Backbone Speech Synthesizer for HierSpeech++

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

The Preprocessing and Training That HierSpeech++ Went Through

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps