The Model Architecture for Text-to-Vec

The content encoder of the TTV consists of 16 layers of noncausal WaveNet with a hidden size of 256 and a kernel size of five. Content decoder consists of eight layers of non-causal WaveNet with hidden size of 512 and kernel size of five. The text encoder is composed of three unconditional Transformer networks and three prosody-conditional Transformer networks with a kernel size of nine, a hidden size of 256 and a filter size of 1024. We utilize a dropout rate of 0.2 for text encoder. T-Flow consists of four residual coupling layers which are composed of a preConv, three Transformer blocks, and postConv. We adopt convolutional neural networks with a kernel size of 5 in Transformer blocks for encoding adjacent information and AdaLN-Zero for better prosody style adaptation. We utilize a hidden size of 256, a filter size of 1024, and four attention heads for T-Flow. We utilize a dropout rate of 0.1 for T-Flow. For the pitch predictor, we utilize the source generator with the same structure as that of HAG.

3.5.2 SpeechSR

The SpeechSR consists of a single AMP block with an initial channel of 32 without an upsampling layer. We utilize an NN upsampler for upsampling the hidden representations. For the discriminator, we utilize the MPD with the period of [2,3,5,7,11] and MS-STFTD with six different sizes of window ([4096,2048,1024,512,256,128]). Additionally, we utilize DWTD which has four sub-band discriminators.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

The Model Architecture for Text-to-Vec

Too Long; Didn't Read

Table of Links

3.5 Model Architecture

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

The Model Architecture for Text-to-Vec

Too Long; Didn't Read

Table of Links

3.5 Model Architecture

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics