New Story

A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

by The FewShot Prompting Publication December 17th, 2024

Too Long; Didn't Read

Following VITS [35], we utilize a variational autoencoder and a monotonic alignment search (MAS) to align the text and speech internally

featured image - A Text-To-Vec Model That Can Generate A Semantic Representation and F0 From A Text Sequence

‘a robot writing on an ancient scroll animated’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

3.3 Text-to-Vec

For TTS, we introduce a text-to-vec (TTV) model that generates a semantic representation and F0 from a text sequence. Following VITS [35], we utilize a variational autoencoder and a monotonic alignment search (MAS) to align the text and speech internally, as shown in Fig 4. We replace the linear spectrogram with a self-supervised speech representation for the input of posterior encoder, and we reconstruct the same self-supervised speech representation for the output of TTV. Furthermore, we predict a F0 with four× larger resolutions than the self-supervised speech representation. We use a text sequence and prosody prompt as conditional information to generate a self-supervised speech representation of the data. We utilize a prosody conditional text representation as the prior information. A prosody style representation is extracted from the full-length input speech as a global style embedding. Owing to the semantic information of self supervised speech representation, we can transfer a prosody style in the TTV framework which is almost irrelevant to the voice style. To increase the linguistic capacity of a semantic representation, a latent representation is fed to the phoneme encoder, and the connectionist temporal classification (CTC) loss is minimized. We found that this could improve text speech alignment by significantly decreasing the CER and WER of synthetic speech. Furthermore, we use a Transformer based normalizing flow with AdaLN-Zero for better prosody adaptation.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.