paint-brush
How We Used the LibriTTS Dataset to Train the Hierarchical Speech Synthesizerby@fewshot
New Story

How We Used the LibriTTS Dataset to Train the Hierarchical Speech Synthesizer

tldt arrow

Too Long; Didn't Read

We utilized LibriTTS dataset [90] to train the hierarchical speech synthesizer. First, we trained the model with trainclean subsets of LibriTTS (train-clean-100 and train-clean-360) for a fair comparison.
featured image - How We Used the LibriTTS Dataset to Train the Hierarchical Speech Synthesizer
The FewShot Prompting Publication  HackerNoon profile picture

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5 EXPERIMENT AND RESULT

TABLE 1: Training Dataset. We utilize public-available speech dataset to train the model. For TTV, we utilize only LibriTTS dataset.

5.1 Dataset

We utilized LibriTTS dataset [90] to train the hierarchical speech synthesizer. First, we trained the model with trainclean subsets of LibriTTS (train-clean-100 and train-clean-360) for a fair comparison. Additionally, we utilized the trainother-500 subsets of LibriTTS for better voice style transfer. Furthermore, we scaled-up the dataset to 1 kh to improve the robustness and diversity, as indicated in TABLE 1[2] . For the Libri-light [27] and Multi-Speaker Speech Synthesis (MSSS) dataset of AIHub [3] , we sampled a small portion of speech from each speaker. We used a EXPRESSO [61] and NIKL[4]. We downsampled the audio at 16 kHz, and normalized it using a scale of [-0.95, 0.95]. For text-to-vec, we utilized all the train subsets of LibriTTS. For speechSR, we used a VCTK dataset [76] which has a sampling rate of 48 kHz to compare the models. However, we also trained the model with a largescale dataset for better speech super-resolution performance by including MSSS dataset, VCTK, and EXPRESSO.


This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


[2] Although we hope to increase the data scale to over 10k Hours, this is the maximum limit in our academic resources.


[3] https://aihub.or.kr


[4] https://www.nia.or.kr/

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.