2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
We use a low-resolution speech dataset to train the model in terms of data availability and efficiency. In this stage, we simply upsample a low-resolution speech waveform to a high-resolution speech waveform from 16 kHz to 48 kHz as illustrated in Fig 5. We use only one anti-aliased multiperiodicity composition (AMP) block of BigVGAN, which consists of a low-pass filter and periodic activation Snake function for inductive bias and anti-aliasing. We further replace a transposed convolution with nearest neighbor (NN) upsampler. Previously, an NN upsampler was shown to alleviate tonal artifacts caused by transposed convolutions. We found that the NN upsampler also reduce the error in the high spectrum compared to the transposed convolution. Based on our hierarchical speech synthesizer, we use MPD and MS-STFTD for high-quality audio synthesis. Additionally, we propose a DWT-based sub-band discriminator (DWTD) to decompose the audio component and reflect the features of each sub-band, respectively as shown in Fig 5. Previously, Fre-GAN [36] and Fre-GAN 2 [47] already utilize a DWT-based discriminator to replace average polling with a DWT for lossless downsampling. However, in this work, we also decompose a discriminator into sub-band discriminators for each sub-audio ([0 kHz, 12 kHz], [12 kHz, 24 kHz], [24 kHz, 36 kHz], [36 kHz, 48 kHz]), which improves the reconstruction quality for high-frequency bands.
3.4.1 Hierarhical Speech Synthesizer
Dual-audio acoustic encoder consists of waveform audio encoder (wav encoder) and linear spectrogram encoder (spec encoder). The wav encoder consists of AMP blocks of the BigVGAN and downsampling blocks to map the temporal sequence between spectrogram and wav2vec representation. We use downsampling rates of [8,5,4,2] with kernel sizes of [17,10,8,4] and hidden sizes of [16,32,64,128,192]. For the spec encoder, we utilize 16 layers of non-causal WaveNet with hidden size of 192. The HAG consists of source generator and waveform generator. We replace multi-receptive field fusion (MRF) blocks with AMP blocks that has a low-pass filter and periodic activation Snake function for inductive bias and antialiasing. For the source generator, we utilize an upsampling rate of [2,2]] with an initial channel of 256. For the waveform generator, we utilize upsampling rates of [4,5,4,2,2] with initial channel of 512. For a discriminator, we utilize a multiperiod discriminator (MPD) with the period of [2,3,5,7,11] and a multi-scale STFT-based discriminator (MS-STFTD) with five different sizes of window ([2048,1024,512,256,128]). For the Source-filter encoder, source, filter, and adaptive encoders consist of eight layers of non-causal WaveNet with hidden size of 192. BiT-Flow consists of four residual coupling layers which comprises a previous convolutional networks (preConv), three Transformer blocks, and post convolutional networks (postConv). We adopt convolutional neural networks with a kernel size of five in Transformer blocks for encoding adjacent information and AdaLN-Zero for better voice style adaptation. We utilize a hidden size of 192, a filter size of 768, and two attention heads for the Transformer blocks. We utilize a dropout rate of 0.1 for BiTFlow. The style encoder consists of two spectral encoders with linear projection and two temporal encoder with 1D convolutional networks, and multi-head self-attention.
periodic activation Snake function for inductive bias and antialiasing. For the source generator, we utilize an upsampling rate of [2,2]] with an initial channel of 256. For the waveform generator, we utilize upsampling rates of [4,5,4,2,2] with initial channel of 512. For a discriminator, we utilize a multiperiod discriminator (MPD) with the period of [2,3,5,7,11] and a multi-scale STFT-based discriminator (MS-STFTD) with five different sizes of window ([2048,1024,512,256,128]). For the Source-filter encoder, source, filter, and adaptive encoders consist of eight layers of non-causal WaveNet with hidden size of 192. BiT-Flow consists of four residual coupling layers which comprises a previous convolutional networks (preConv), three Transformer blocks, and post convolutional networks (postConv). We adopt convolutional neural networks with a kernel size of five in Transformer blocks for encoding adjacent information and AdaLN-Zero for better voice style adaptation. We utilize a hidden size of 192, a filter size of 768, and two attention heads for the Transformer blocks. We utilize a dropout rate of 0.1 for BiTFlow. The style encoder consists of two spectral encoders with linear projection and two temporal encoder with 1D convolutional networks, and multi-head self-attention.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.