Table of Contents
In the previous post, I tried a little bit of TTS Generation WebUI and found it interesting. So, I decided to train a usable model with my own voice.
This voice cloning project explores both SVC for Voice Changing and VITS for Text-to-Speech. There is no one tool that does all jobs.
I have tested several tools for this project. Many of the good guides, like this, this, and this, are in Chinese. So, I thought it would be useful to post my notes in English.
Although so-vits-svc has been archived for a few months, probably due to oppression, it is still the tool for the best result.
Other related tools such as so-vits-svc-fork, so-vits-svc-5.0, DDSP-SVC, and RVC provide either faster/liter optimization, more features, or better interfaces.
But with enough time and resources, none of these alternatives can compete with the superior result generated by the original so-vits-svc.
For TTS, a new tool called Bert-VITS2 works fantastically and has already matured with its final release last month. It has some very different use cases, for example, audio content creation.
The audio files of the dataset should be in WAV format, 44100 Hz, 16bit, mono, 1-2 hours ideally.
Ultimate Vocal Remover is the easiest tool for this job. There is a thread that explains everything in detail.
Model: VR - UVR(4_HP-Vocal-UVR)
Settings: 512 - 10 - GPU
Output Instrumental and unclean vocal
Model: VR - UVR(5_HP-Karaoke-UVR)
Settings: 512 - 10 - GPU
Output background vocal and unclean main vocal
Model: VR - UVR-DeEcho-DeReverb & UVR-DeNoise
Settings: 512 - 10 - GPU - No Other Only
Output clean main vocal
It’s better to record in a treated room with a condenser microphone, otherwise, use a directional or dynamic microphone to reduce noise.
The very first time I got into music was during high school, with the blue Sennheiser MX500 and Koss Porta Pro. I still remember the first time I was recording a song that was on a Sony VAIO with Cool Edit Pro.
Nowadays, I still resist spending a lot of money on audio hardware as an amateur because it is literally a money-sucking black hole.
Nonetheless, I really appreciate the reliability of those cheap production equipment.
The core part of my setup is a Behringer UCA202, and it’s perfect for my use cases. I bought it for $10 during a price drop.
It is a so-called “Audio Interface” but basically just a sound card with multiple ports. I used RCA to 3.5mm TRS cables for my headphones, a semi-open K240s for regular output, and a closed-back HD669/MDR7506 for monitor output.
All three mentioned headphones are under $100 for the normal price. And there are clones from Samson, Tascam, Knox Gear, and more out there for less than $50.
For the input device, I’m using a dynamic microphone for the sake of my environmental noises. It is an SM58 copy (Pyle) + a Tascam DR-05 recorder (as an amplifier). Other clones such as SL84c or wm58 would do it too.
I use an XLR to 3.5mm TRS cable to connect the microphone to the MIC/External-input of the recorder, and then use an AUX cable to connect between the line-out of the recorder and the input of the UCA202.
It’s not recommended to buy an “audio interface” and a dedicated amplifier to replicate my setup. A $10 c-media USB sound card should be good enough. The Syba model that I owned is capable of “pre-amping” dynamic microphones directly and even some lower-end phantom-powered microphones.
The setup can go extremely cheap ($40~60), but with UCA202 and DR-05, the sound is much cleaner. And I really like the physical controls, versatility, and portability of my old good digital recorder.
Although when I was getting paid as a designer, I was pretty happy with Audition. But for personal use on a fun project, Audacity is the way to avoid the chaotic evil of Adobe.
Use audio-slicer or audio-slicer (gui) to slice the audio file into small pieces for later use.
The default setting works great.
Remove those very short ones and re-slice those that are still over 10 seconds.
In case of a large dataset, remove all that are less than 4 sec. In the case of a small dataset, remove those only under 2 sec.
If necessary, perform a manual inspection for every single file.
Use Audacity again with Loudness Normalization; 0db should do it.
Virtual environment is essential to run multiple Python tools inside one system. I used to use VMs and Docker, but now, I found that anaconda is way quicker and handier than the others.
Create a new environment for so-vits-svc, and activate it.
conda create -n so-vits-svc python=3.8
conda activate so-vits-svc
Then, install requirements.
git clone https://github.com/svc-develop-team/so-vits-svc
cd so-vits-svc
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
#for linux
pip install -r requirements.txt
#for windows
pip install -r requirements_win.txt
pip install --upgrade fastapi==0.84.0
pip install --upgrade gradio==3.41.2
pip install --upgrade pydantic==1.10.12
pip install fastapi uvicorn
wget https://huggingface.co/WitchHuntTV/checkpoint_best_legacy_500.pt/resolve/main/checkpoint_best_legacy_500.pt
wget https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/rmvpe.pt
wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_D_320000.pth
wget https://huggingface.co/datasets/ms903/sovits4.0-768vec-layer12/resolve/main/sovits_768l12_pre_large_320k/clean_G_320000.pth
wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/resolve/main/fix_pitch_add_vctk_600k/model_0.pt
(Alternative) wget https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt
(Alternative) wget https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt
wget -P pretrain/ https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
unzip -od pretrain/nsf_hifigan pretrain/nsf_hifigan_20221211.zip
Put all Prepared audio.wav files into dataset_raw/character
cd so-vits-svc
python resample.py --skip_loudnorm
python preprocess_flist_config.py --speech_encoder vec768l12 --vol_aug
python preprocess_hubert_f0.py --use_diff
The file is located at configs/config.json
log interval
: the frequency of printing log eval interval
: the frequency of saving checkpoints epochs
: total steps keep ckpts
: numbers of saved checkpoints, 0 for unlimited. half_type
: fp32 In my case, batch_size
: the smaller the faster (rougher), the larger the slower (better).
Recommended batch_size per VRAM: 4=6G;6=8G;10=12G;14=16G;20=24G
Keep default for configs/diffusion.yaml
python cluster/train_cluster.py --gpu
python train_index.py -c configs/config.json
python train.py -c configs/config.json -m 44k
python train_diff.py -c configs/diffusion.yaml
On training steps:
Use train.py
to train the main model; usually, 20k-30k would be usable, and 50k and up would be good enough. This can take a few days depending on the GPU speed.
Feel free to stop it by ctrl+c
, and it will be continued training by re-run python train.py -c configs/config.json -m 44k
anytime.
Use train_diff.py
to train the diffusion model; training steps is recommended at 1/3 of the main model.
Be aware of over-training. Use tensorboard --logdir=./logs/44k
to monitor the plots to see if it goes flat.
Change the learning rate
from 0.0001 to 0.00005 if necessary.
When done, share/transport these files for inference.
config.json
diffusion.yaml
It’s time to try out the trained model. I’d prefer Webui for the convenience of tweaking the parameters.
But before firing it up, edit the following lines in webUI.py
for LAN access:
os.system("start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860)
Run python webUI.py
; then access its ipaddress:7860
from a web browser.
The webui has no English localization, but Immersive Translate would be helpful.
Most parameters would work well with the default value. Refer to this and this to make changes.
Upload these 5 files:
main model.pt
and its config.json
diffusion model.pt
and its diffusion.yaml
Either cluster model kmeans_10000.pt
for speaking or feature retrieval feature_and_index.pkl
for singing.
F0 predictor
is for speaking only, not for singing. Recommend RMVPE
when using.
Pitch change
is useful when singing a feminine song using a model with a masculine voice, or vice versa.
Clustering model/feature retrieval mixing ratio
is the way of controlling the tone. Use 0.1
to get the clearest speech, and use 0.9
to get the closest tone to the model.
shallow diffusion steps
should be set around 50
, it enhances the result at 30-100
steps.
This procedure is optional. Just for the production of a better song.
I won’t go into details about this since the audio editing software, or so-called DAW (digital audio workstation), that I’m using is non-free. I have no intention to advocate proprietary software even though the entire industry is paywalled and closed-source.
Audacity supports multitrack, effects, and a lot more. It does load some advanced VST plugins as well.
It’s not hard to find tutorials on mastering songs with Audacity.
Typically, the mastering process should be mixing/balancing, EQ/compressing, reverb, and imaging. The more advanced the tool is, the easier the process will be.
I’ll definitely spend more time on adopting Audacity for my mastering process in the future, and I recommend everyone do so.
This is a so-vits-svc fork with real-time support, and the models are compatible. Easier to use but does not support the Diffusion model. For dedicated real-time voice changing, a voice-changer is recommended.
conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork
git clone https://github.com/voicepaw/so-vits-svc-fork
cd so-vits-svc-fork
python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork
pip install click
sudo apt-get install libportaudio2
Put dataset .wav files into so-vits-svc-fork/dataset_raw
svc pre-resample
svc pre-config
Edit a batch_size
in configs/44k/config.json
. This fork is a larger size than the original.
svc pre-hubert
svc train -t
svc train-cluster
Use GUI with svcg
. This requires a local desktop environment.
Or use CLI with svc vc
for real time andsvc infer -m "logs/44k/xxxxx.pth" -c "configs/config.json" raw/xxx.wav
for generating.
DDSP-SVC requires fewer hardware resources and runs faster than so-vits-svc. It supports both real-time and diffusion models (Diff-SVC).
conda create -n DDSP-SVC python=3.8
conda activate DDSP-SVC
git clone https://github.com/yxlllc/DDSP-SVC
cd DDSP-SVC
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Refer to the Initialization section for the two files:
pretrain/rmvpe/model.pt
pretrain/contentvec/checkpoint_best_legacy_500.pt
python draw.py
python preprocess.py -c configs/combsub.yaml
python preprocess.py -c configs/diffusion-new.yaml
Edit configs/
batch_size: 32 (16 for diffusion)
cache_all_data: false
cache_device: 'cuda'
cache_fp16: false
conda activate DDSP-SVC
python train.py -c configs/combsub.yaml
python train_diff.py -c configs/diffusion-new.yaml
tensorboard --logdir=exp
It’s recommended to use main_diff.py
since it includes both DDSP and diffusion model.
python main_diff.py -i "input.wav" -diff "model_xxxxxx.pt" -o "output.wav"
Real-time GUI for voice cloning:
python gui_diff.py
This is a TTS tool that is completely different from everything above. By using it, I have already created several audiobooks with my voice for my parents, and they really enjoy it.
Instead of using the original, I used the fork by v3u for an easier setup.
conda create -n bert-vits2 python=3.9
conda activate bert-vits2
git clone https://github.com/v3ucn/Bert-vits2-V2.3.git
cd Bert-vits2-V2.3
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Download pre-trained models (includes Chinese, Japanese, and English):
wget -P slm/wavlm-base-plus/ https://huggingface.co/microsoft/wavlm-base-plus/resolve/main/pytorch_model.bin
wget -P emotional/clap-htsat-fused/ https://huggingface.co/laion/clap-htsat-fused/resolve/main/pytorch_model.bin
wget -P emotional/wav2vec2-large-robust-12-ft-emotion-msp-dim/ https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim/resolve/main/pytorch_model.bin
wget -P bert/chinese-roberta-wwm-ext-large/ https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin
wget -P bert/bert-base-japanese-v3/ https://huggingface.co/cl-tohoku/bert-base-japanese-v3/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.bin
wget -P bert/deberta-v3-large/ https://huggingface.co/microsoft/deberta-v3-large/resolve/main/pytorch_model.generator.bin
wget -P bert/deberta-v2-large-japanese/ https://huggingface.co/ku-nlp/deberta-v2-large-japanese/resolve/main/pytorch_model.bin
Create a character model folder mkdir -p Data/xxx/models/
Download base models:
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/DUR_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/D_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/G_0.pth
!wget -P Data/xxx/models/ https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/resolve/main/WD_0.pth
#More options
https://openi.pcl.ac.cn/Stardust_minus/Bert-VITS2/modelmanage/model_filelist_tmpl?name=Bert-VITS2_2.3%E5%BA%95%E6%A8%A1
https://huggingface.co/Erythrocyte/bert-vits2_base_model/tree/main
https://huggingface.co/OedoSoldier/Bert-VITS2-2.3/tree/main
Edit train_ms.py
by replacing all bfloat16
to float16
Edit webui.py
for LAN access:
webbrowser.open(f"start http://localhost:7860")
app.launch(server_name="0.0.0.0", server_port=7860)
Edit Data/xxx/config.json
for batch_size
and spk2id
Similar workflow as in the previous section.
Remove noise and silence, normalization, then put the un-sliced WAV file into Data/xxx/raw
.
Edit config.yml
for dataset_path
, num_workers
, and keep_ckpts
.
Run python3 audio_slicer.py
to slice the WAV file.
Clean the dataset (Data/xxx/raw
) by removing small files that are under 2 sec.
Install whisper pip install git+https://github.com/openai/whisper.git
To turn off language auto-detection, set it to English only, and use large
model; edit short_audio_transcribe.py
as below:
# set the spoken language to english
print('language: en')
lang = 'en'
options = whisper.DecodingOptions(language='en')
result = whisper.decode(model, mel, options)
# set to use large model
parser.add_argument("--whisper_size", default="large")
#Solve error "Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead" while using large model
mel = whisper.log_mel_spectrogram(audio,n_mels = 128).to(model.device)
Run python3 short_audio_transcribe.py
to start the transcription.
Re-sample the sliced dataset: python3 resample.py --sr 44100 --in_dir ./Data/zizek/raw/ --out_dir ./Data/zizek/wavs/
Preprocess transcription: python3 preprocess_text.py --transcription-path ./Data/zizek/esd.list
Generate BERT feature config: python3 bert_gen.py --config-path ./Data/zizek/configs/config.json
Run python3 train_ms.py
to start training
Edit config.yml
for model path:
model: "models/G_20900.pth"
Run python3 webui.py
to start webui for inference
vits-simple-api is a web frontend for using trained models. I use this mainly for its long text support which the original project doesn’t have.
git clone https://github.com/Artrajz/vits-simple-api
git pull https://github.com/Artrajz/vits-simple-api
cd vits-simple-api
conda create -n vits-simple-api python=3.10 pip
conda activate vits-simple-api &&
pip install -r requirements.txt
(Optional) Copy pre-trained model files from Bert-vits2-V2.3/
to vits-simple-api/bert_vits2/
Copy Bert-vits2-V2.3/Data/xxx/models/G_xxxxx.pth
and Bert-vits2-V2.3/Data/xxx/config.json
to vits-simple-api/Model/xxx/
Edit config.py
for MODEL_LIST
and Default parameter
as preferred
Edit Model/xxx/config.json
as below:
"data": {
"training_files": "Data/train.list",
"validation_files": "Data/val.list",
"version": "2.3"
Check/Edit model_list
in config.yml
as [xxx/G_xxxxx.pth, xxx/config.json]
Run python app.py
SDP Ratio
for tone, Noise
for randomness, Noise_W
for pronunciation, and Length
for speed. emotion
and style
are self-explanatory
In its Hugging Face repo, there are a lot of VITS models shared by others. You can try it out first, and then download the desired models from Files.
The Genshin model is widely used in some content creation communities because of its high quality. It contains hundreds of characters, although only Chinese and Japanese are supported.
In another repo, there are a lot of Bert-vits2 models that are made from popular Chinese streamers and VTubers.
There are already projects making AI Vtubers like this and this. I’m looking forward to how this technology can change the industry in the near future.
https://techshinobi.org/posts/voice-vits/https://techshinobi.org/posts/voice-vits/