Let's take a look at how one may setup an ad-hoc, local (offline) text-to-speech synthesizer using piper, piper-whistle and named pipes.
To setup piper on a GNU/Linux based system, I'll describe a general architecture using named pipes, which is straight forward enough to allow for system wide text-to-speech, with a little bit of manual setup, the help of piper-whistle and some minor trade-offs (it's simple, yet it won't support parallel speech processing).
To start, let's fetch the latest piper stand-alone built from its repository hosted on github (2023.11.14-2 at the time of writing this). After downloading the compressed archive, we'll create a directory structure for our piper setup. The root directory shall be at /opt/wind
with the following sub-directories:
/opt/wind/piper
(will house the piper built)/opt/wind/channels
(will contain the named pipes)
After decompressing, the piper executable should be available at /opt/wind/piper/piper
, as well the accompanying libraries and espeak-ng-data
.
For managing voice models used by piper, I'd recommend using piper-whistle, a command-line utility written in python, which makes it more convenient to download and manage voices.
You can get the latest wheel file from its gitlab or github release page, or use the most recent release available through pip via pip install -U piper-whistle
. After installing piper-whistle, let's fetch a voice to generate some speech. First we’ll update the database by calling piper_whistle -vR
. For English speech, I quite like the female voice called alba. Using whistle, we can get a list of all available English (GB) voices using piper_whistle list -l en_GB
. The voice is at index 2. So to install it, simply call piper_whistle install en_GB 2
.
Next, let's create the necessary named pipes. The resulting structure will look like this:
/opt/wind/channels/speak
(accepts JSON payload)/opt/wind/channels/input
(read by piper)/opt/wind/channels/ouput
(written by piper)
To create a named pipe, you may use the following command: mkfifo -m 755 /opt/wind/channels/input
Finally, we create three processes in separate shells:
tail -F /opt/wind/channels/speak | tee /opt/wind/channels/input
/opt/wind/piper/piper -m $(piper_whistle path alba@medium) --debug --json-input --output_raw < /opt/wind/channels/input > /opt/wind/channels/output
aplay --buffer-size=777 -r 22050 -f S16_LE -t raw < /opt/wind/channels/output
The process on tty0 makes sure, the pipe is kept open, even after the processing by piper or aplay has been finished. This way, we can queue TTS requests and subsequently play or save them.
Since piper-whistle offers additional features if you use the structure above, we can now generate speech via piper_whistle speak "This is quite neat"
. On systems with X11 you may generate a spoken version of the text in your clipboard via piper_whistle speak "$(xsel --clipboard --output)"
.