How to Voice Clone With SoftVC VITS and Bert-VITS2: A Deep Dive

Table of Contents

A Deep Dive into Voice Cloning with SoftVC VITS and Bert-VITS2
- Prepare Dataset
  - Extract from a Song
    - UVR Workflows
  - Preparation for vocal recording
    - Cheapskate’s Audio Equipment
    - Audacity workflows
  - audio-slicer
    - Cleaning dataset
    - Match loudness
- so-vits-svc
  - Set up environment
  - Initialization
    - Download pre-trained models
    - Dataset Preparation
    - Edit Configs
  - Training
  - Inference
  - Audio Editing
- so-vits-svc-fork
  - Installation
  - Preparation
  - Training
  - Inference
- DDSP-SVC
  - Preparation
  - Training
  - Inference
- Bert-vits2-V2.3
  - Initialization
  - Preparation
    - Transcription
  - Training and Inference
- vits-simple-api
  - Tweaks
  - Share models

In the previous post, I tried a little bit of TTS Generation WebUI and found it interesting. So, I decided to train a usable model with my own voice.

This voice cloning project explores both SVC for Voice Changing and VITS for Text-to-Speech. There is no one tool that does all jobs.

I have tested several tools for this project. Many of the good guides, like this, this, and this, are in Chinese. So, I thought it would be useful to post my notes in English.

Although so-vits-svc has been archived for a few months, probably due to oppression, it is still the tool for the best result.

Other related tools such as so-vits-svc-fork, so-vits-svc-5.0, DDSP-SVC, and RVC provide either faster/liter optimization, more features, or better interfaces.

But with enough time and resources, none of these alternatives can compete with the superior result generated by the original so-vits-svc.

For TTS, a new tool called Bert-VITS2 works fantastically and has already matured with its final release last month. It has some very different use cases, for example, audio content creation.

Prepare Dataset

The audio files of the dataset should be in WAV format, 44100 Hz, 16bit, mono, 1-2 hours ideally.

Extract From a Song

Ultimate Vocal Remover is the easiest tool for this job. There is a thread that explains everything in detail.

UVR Workflows

Remove and extract Instrumental
- Model: VR - UVR(4_HP-Vocal-UVR)
- Settings: 512 - 10 - GPU
- Output Instrumental and unclean vocal
Remove and extract background vocals.
- Model: VR - UVR(5_HP-Karaoke-UVR)
- Settings: 512 - 10 - GPU
- Output background vocal and unclean main vocal
Remove reverb and noise.
- Model: VR - UVR-DeEcho-DeReverb & UVR-DeNoise
- Settings: 512 - 10 - GPU - No Other Only
- Output clean main vocal
(Optional) Using RipX (non-free) to perform a manual fine-cleaning.

Preparation for Vocal Recording

It’s better to record in a treated room with a condenser microphone, otherwise, use a directional or dynamic microphone to reduce noise.

Cheapskate’s Audio Equipment

The very first time I got into music was during high school, with the blue Sennheiser MX500 and Koss Porta Pro. I still remember the first time I was recording a song that was on a Sony VAIO with Cool Edit Pro.

Nowadays, I still resist spending a lot of money on audio hardware as an amateur because it is literally a money-sucking black hole.

Nonetheless, I really appreciate the reliability of those cheap production equipment.

The core part of my setup is a Behringer UCA202, and it’s perfect for my use cases. I bought it for $10 during a price drop.

It is a so-called “Audio Interface” but basically just a sound card with multiple ports. I used RCA to 3.5mm TRS cables for my headphones, a semi-open K240s for regular output, and a closed-back HD669/MDR7506 for monitor output.

All three mentioned headphones are under $100 for the normal price. And there are clones from Samson, Tascam, Knox Gear, and more out there for less than $50.

For the input device, I’m using a dynamic microphone for the sake of my environmental noises. It is an SM58 copy (Pyle) + a Tascam DR-05 recorder (as an amplifier). Other clones such as SL84c or wm58 would do it too.

I use an XLR to 3.5mm TRS cable to connect the microphone to the MIC/External-input of the recorder, and then use an AUX cable to connect between the line-out of the recorder and the input of the UCA202.

It’s not recommended to buy an “audio interface” and a dedicated amplifier to replicate my setup. A $10 c-media USB sound card should be good enough. The Syba model that I owned is capable of “pre-amping” dynamic microphones directly and even some lower-end phantom-powered microphones.

The setup can go extremely cheap ($40~60), but with UCA202 and DR-05, the sound is much cleaner. And I really like the physical controls, versatility, and portability of my old good digital recorder.

Audacity Workflows

Although when I was getting paid as a designer, I was pretty happy with Audition. But for personal use on a fun project, Audacity is the way to avoid the chaotic evil of Adobe.

audio-slicer

Use audio-slicer or audio-slicer (gui) to slice the audio file into small pieces for later use.

The default setting works great.

Cleaning Dataset

Remove those very short ones and re-slice those that are still over 10 seconds.

In case of a large dataset, remove all that are less than 4 sec. In the case of a small dataset, remove those only under 2 sec.

If necessary, perform a manual inspection for every single file.

Match Loudness

Use Audacity again with Loudness Normalization; 0db should do it.

so-vits-svc

Set Up the Environment

Virtual environment is essential to run multiple Python tools inside one system. I used to use VMs and Docker, but now, I found that anaconda is way quicker and handier than the others.

Create a new environment for so-vits-svc, and activate it.

conda create -n so-vits-svc python=3.8
conda activate so-vits-svc

Then, install requirements.

git clone https://github.com/svc-develop-team/so-vits-svc
cd so-vits-svc

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

#for linux
pip install -r requirements.txt

#for windows
pip install -r requirements_win.txt
pip install --upgrade fastapi==0.84.0
pip install --upgrade gradio==3.41.2
pip install --upgrade pydantic==1.10.12
pip install fastapi uvicorn

Initialization

Download pre-trained models

pre-train
- wget https://huggingface.co/WitchHuntTV/checkpoint_best_legacy_500.pt/resolve/main/checkpoint_best_legacy_500.pt
- wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt
logs/44k
- wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_D_320000.pth
- wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_G_320000.pth
logs/44k/diffusion
- wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/resolve/main/fix_pitch_add_vctk_600k/model_0.pt
- (Alternative) wget https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt
- (Alternative) wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt
pre-train/nsf_hifigan
- wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
- unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip

Dataset Preparation

Put all Prepared audio.wav files into dataset_raw/character

cd so-vits-svc
python resample.py --skip_loudnorm
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
python preprocess_hubert_f0.py --use_diff

Edit Configs

The file is located at configs/config.json

log interval : the frequency of printing log eval interval : the frequency of saving checkpoints epochs : total steps keep ckpts : numbers of saved checkpoints, 0 for unlimited. half_type : fp32 In my case, batch_size : the smaller the faster (rougher), the larger the slower (better).

Recommended batch_size per VRAM: 4=6G；6=8G；10=12G；14=16G；20=24G

Keep default for configs/diffusion.yaml

Training

python cluster/train_cluster.py --gpu
python train_index.py -c configs/config.json
python train.py -c configs/config.json -m 44k
python train_diff.py -c configs/diffusion.yaml

On training steps:

Use train.py to train the main model; usually, 20k-30k would be usable, and 50k and up would be good enough. This can take a few days depending on the GPU speed.

Feel free to stop it by ctrl+c, and it will be continued training by re-run python train.py -c configs/config.json -m 44k anytime.

Use train_diff.py to train the diffusion model; training steps is recommended at 1/3 of the main model.

Be aware of over-training. Use tensorboard --logdir=./logs/44k to monitor the plots to see if it goes flat.

Change the learning rate from 0.0001 to 0.00005 if necessary.

When done, share/transport these files for inference.

config/
- config.json
- diffusion.yaml
logs/44k
- feature_and_index.pkl
- kmeans_10000.pt
- model_0.pt
- G_xxxxx.pt

Inference

It’s time to try out the trained model. I’d prefer Webui for the convenience of tweaking the parameters.

But before firing it up, edit the following lines in webUI.py for LAN access:

os.system("start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860)

Run python webUI.py; then access its ipaddress:7860 from a web browser.

The webui has no English localization, but Immersive Translate would be helpful.

Most parameters would work well with the default value. Refer to this and this to make changes.

Upload these 5 files:

main model.pt and its config.json

diffusion model.pt and its diffusion.yaml

Either cluster model kmeans_10000.pt for speaking or feature retrieval feature_and_index.pkl for singing.

F0 predictor is for speaking only, not for singing. Recommend RMVPE when using.

Pitch change is useful when singing a feminine song using a model with a masculine voice, or vice versa.

Clustering model/feature retrieval mixing ratio is the way of controlling the tone. Use 0.1 to get the clearest speech, and use 0.9 to get the closest tone to the model.

shallow diffusion steps should be set around 50, it enhances the result at 30-100 steps.

Audio Editing

This procedure is optional. Just for the production of a better song.

I won’t go into details about this since the audio editing software, or so-called DAW (digital audio workstation), that I’m using is non-free. I have no intention to advocate proprietary software even though the entire industry is paywalled and closed-source.

Audacity supports multitrack, effects, and a lot more. It does load some advanced VST plugins as well.

It’s not hard to find tutorials on mastering songs with Audacity.

Typically, the mastering process should be mixing/balancing, EQ/compressing, reverb, and imaging. The more advanced the tool is, the easier the process will be.

I’ll definitely spend more time on adopting Audacity for my mastering process in the future, and I recommend everyone do so.

so-vits-svc-fork

This is a so-vits-svc fork with real-time support, and the models are compatible. Easier to use but does not support the Diffusion model. For dedicated real-time voice changing, a voice-changer is recommended.

Installation

conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork

git clone https://github.com/voicepaw/so-vits-svc-fork
cd so-vits-svc-fork

python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork
pip install click
sudo apt-get install libportaudio2

Preparation

Put dataset .wav files into so-vits-svc-fork/dataset_raw

svc pre-resample
svc pre-config

Edit a batch_size in configs/44k/config.json. This fork is a larger size than the original.

Training

svc pre-hubert
svc train -t
svc train-cluster

Inference

Use GUI with svcg. This requires a local desktop environment.

Or use CLI with svc vc for real time andsvc infer -m "logs/44k/xxxxx.pth" -c "configs/config.json" raw/xxx.wav for generating.

DDSP-SVC

DDSP-SVC requires fewer hardware resources and runs faster than so-vits-svc. It supports both real-time and diffusion models (Diff-SVC).

conda create -n DDSP-SVC python=3.8
conda activate DDSP-SVC

git clone https://github.com/yxlllc/DDSP-SVC
cd DDSP-SVC

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Refer to the Initialization section for the two files:

pretrain/rmvpe/model.pt
pretrain/contentvec/checkpoint_best_legacy_500.pt

Preparation

python draw.py
python preprocess.py -c configs/combsub.yaml
python preprocess.py -c configs/diffusion-new.yaml

Edit configs/

batch_size: 32  (16 for diffusion)
cache_all_data: false
cache_device: 'cuda'
cache_fp16: false

Training

conda activate DDSP-SVC
python train.py -c configs/combsub.yaml
python train_diff.py -c configs/diffusion-new.yaml

tensorboard --logdir=exp

Inference

It’s recommended to use main_diff.py since it includes both DDSP and diffusion model.

python main_diff.py -i "input.wav" -diff "model_xxxxxx.pt" -o "output.wav"

Real-time GUI for voice cloning:

python gui_diff.py

Bert-vits2-V2.3

This is a TTS tool that is completely different from everything above. By using it, I have already created several audiobooks with my voice for my parents, and they really enjoy it.

Instead of using the original, I used the fork by v3u for an easier setup.

Initialization

conda create -n bert-vits2 python=3.9
conda activate bert-vits2

git clone https://github.com/v3ucn/Bert-vits2-V2.3.git
cd Bert-vits2-V2.3

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Download pre-trained models (includes Chinese, Japanese, and English):

wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin

Create a character model folder mkdir -p Data/xxx/models/

Download base models:

!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/DUR_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/D_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/G_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/WD_0.pth

#More options
https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_filelist_tmpl?name=Bert-VITS2_2.3%E5%BA%95%E6%A8%A1
https://huggingface.co/Erythrocyte/bert-vits2_base_model/tree/main
https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/tree/main

Edit train_ms.py by replacing all bfloat16 to float16

Edit webui.py for LAN access:

webbrowser.open(f"start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860)

Edit Data/xxx/config.json for batch_size and spk2id

Preparation

Similar workflow as in the previous section.

Remove noise and silence, normalization, then put the un-sliced WAV file into Data/xxx/raw.

Edit config.yml for dataset_path, num_workers, and keep_ckpts.

Run python3 audio_slicer.py to slice the WAV file.

Clean the dataset (Data/xxx/raw) by removing small files that are under 2 sec.

Transcription

Install whisper pip install git+https://github.com/openai/whisper.git

To turn off language auto-detection, set it to English only, and use large model; edit short_audio_transcribe.py as below:

    # set the spoken language to english
    print('language: en')
    lang = 'en'
    options = whisper.DecodingOptions(language='en')
    result = whisper.decode(model, mel, options)
	
    # set to use large model
    parser.add_argument("--whisper_size", default="large")

    #Solve error "Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead" while using large model
    mel = whisper.log_mel_spectrogram(audio,n_mels = 128).to(model.device)

Run python3 short_audio_transcribe.py to start the transcription.

Re-sample the sliced dataset: python3 resample.py --sr 44100 --in_dir ./Data/zizek/raw/ --out_dir ./Data/zizek/wavs/

Preprocess transcription: python3 preprocess_text.py --transcription-path ./Data/zizek/esd.list

Generate BERT feature config: python3 bert_gen.py --config-path ./Data/zizek/configs/config.json

Training and Inference

Run python3 train_ms.py to start training

Edit config.yml for model path:

model: "models/G_20900.pth"

Run python3 webui.py to start webui for inference

vits-simple-api

vits-simple-api is a web frontend for using trained models. I use this mainly for its long text support which the original project doesn’t have.

git clone https://github.com/Artrajz/vits-simple-api
git pull https://github.com/Artrajz/vits-simple-api
cd vits-simple-api

conda create -n vits-simple-api python=3.10 pip
conda activate vits-simple-api && 

pip install -r requirements.txt

(Optional) Copy pre-trained model files from Bert-vits2-V2.3/ to vits-simple-api/bert_vits2/

Copy Bert-vits2-V2.3/Data/xxx/models/G_xxxxx.pth and Bert-vits2-V2.3/Data/xxx/config.json to vits-simple-api/Model/xxx/

Edit config.py for MODEL_LIST and Default parameter as preferred

Edit Model/xxx/config.json as below:

  "data": {
    "training_files": "Data/train.list",
    "validation_files": "Data/val.list",
	
  "version": "2.3"

Check/Edit model_list in config.yml as [xxx/G_xxxxx.pth, xxx/config.json]

Run python app.py

Tweaks

SDP Ratio for tone, Noise for randomness, Noise_W for pronunciation, and Length for speed. emotion and style are self-explanatory

Share Models

In its Hugging Face repo, there are a lot of VITS models shared by others. You can try it out first, and then download the desired models from Files.

The Genshin model is widely used in some content creation communities because of its high quality. It contains hundreds of characters, although only Chinese and Japanese are supported.

In another repo, there are a lot of Bert-vits2 models that are made from popular Chinese streamers and VTubers.

There are already projects making AI Vtubers like this and this. I’m looking forward to how this technology can change the industry in the near future.

https://techshinobi.org/posts/voice-vits/https://techshinobi.org/posts/voice-vits/