New Story

Speech Synthesis Tasks We Had to Complete: Voice Conversion and Text-to-Speech

by The FewShot Prompting Publication December 17th, 2024

Too Long; Didn't Read

For voice conversion, we first extract the semantic representation by MMS from the audio at 16 kHz, and F0 using the YAPPT algorithm.

featured image - Speech Synthesis Tasks We Had to Complete: Voice Conversion and Text-to-Speech

‘radio waves’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

4 SPEECH SYNTHESIS TASKS

4.1 Voice Conversion

Fig 6 illustrates the entire inference pipeline. For voice conversion, we first extract the semantic representation by MMS from the audio at 16 kHz, and F0 using the YAPPT algorithm. Before feeding F0 to Hierarchical Synthesizer, we normalize F0 using the mean and standard deviation of the source speech. Then, we denormalize a normalized F0 by the mean and standard deviation of the target speech. The speech synthesizer synthesizes 16 kHz speech with a target voice style from the target voice prompt. The SpeechSR can upsample the synthesized speech to a high-resolution speech of 48 kHz. For a fair comparison, we do not utilize SpeechSR to evaluate the VC performance.

4.2 Text-to-Speech

For text-to-speech, we extract semantic representations from text instead of speech. The TTV can generate a semantic representation with the target prosody from the prosody prompt. The hierarchical speech synthesizer generates speech from semantic representations, and the SpeechSR can upsample it to a high-resolution from 16 kHz to 48 kHz. For a fair comparison, SpeechSR was not used during the TTS evaluation. The prosody and voice styles can be transferred from different target prompts, respectively.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.