Coder against injustice - Enthusiast of all things meta
Today's post serves the purpose of explaining Twilio's recent bidirectional streaming feature, the very feature which made it possible to receive the call audio while still being able to send audio for playback asynchronously.
This was not possible before, when communication had to be unidirectional: either you could receive the voice audio but could not respond to it, or you could send audio to the call but not get anything back.
Let's get right into the tutorial. One needs to understand three things if he or she hopes to use Twilio in conjunction with her own Voice Assistant (abbreviated VA).
1. How the backend of the VA receives the user's voice and returns audio responses (HTTP or Websocket? Streaming or one-off? In which audio format?)
2. How to serve calls using a node.js server (How can a phone call to a twilio number perform a request to a Node server?)
3. How to open and use a bidirectional communication channel between node.js code and the actual phone call (Through which protocol?)
We will tackle these questions one at a time. But first, let's see the prerequisites:
* Some experience with node.js HTTP servers and websockets
* A public server with an open port
* SSL keys for the domain of the server
* A Twilio account with a Twilio number, either toll-free or house number
Now, with that said, let's move on to the tutorial!
Most Voice Assistants work in turns. That is, they receive a user utterance (a user phrase), they generate the answer as voice audio, they send it back to the user and then wait for the next utterance, which starts the next dialogue turn.
The user cannot speak again before getting the response from the VA. So, we mainly have two data flows, the voice data from the client to the VA (flow A) and the synthesized voice response from the VA (flow B). This turn-taking model is illustrated in the following graphic:
The next thing to know about our presupposed VA model, is that flow A is an audio stream, whereas flow B is sent as a single message. This follows from the way audio is generated: user voice audio is produced sequentially and can thus be streamed. On the other hand, the VA's response synthesized voice is generated totally, not partially - the first second is available as soon as the whole message is available.
Finally, turn taking is managed using two signals. The first one, RESTART_COMMUNICATION, is sent by the client at the beginning of the user's turn (start of flow A). The other signal is SINGLE_UTTERANCE_END. It is sent by the VA backend as soon as it recognizes the end of single phrase in the incoming audio (alternatively this can be controlled by the client as well). The SINGLE_UTTERANCE_END signal is sent to the client to prevent it from sending further audio. The process can be seen in detail in the flow diagram below:
When I first tried to connect our team's VA to Twilio, one of the first things I checked was Twilio's audio format. I had hoped for an option for wav or mp3 encodings. Alas! The only supported format is that mysterious mulaw (or u-law or μ-law) encoding, which I learned that is aimed specifically for speech audio.
I also learned that it is not widely used outside of telephony; chances are that in your system you won't have mulaw, which means that you will have to do some audio conversion. Although this might take a significant amount of work, I will not cover this part. I will only mention that for python, you can convert
by using the
module in this code snippet.
with wave.open(wavfile, "rb") as wav: # load frames raw_wav= wav.readframes(wav.getnframes()) # downsample raw_wav_8khz, st = audioop.ratecv(raw_wav,2,1,24000,8000,None) # to explain the above: # 2: sample depth in bytes # 1: number of channels # 24000: samplerate # 8000: desired samplerate # convert to mulaw raw_ulaw = audioop.lin2ulaw(raw_wav_8khz,wav.getsampwidth())
From this chapter onwards, we will be discussing the Twilio server itself.
To insert programming logic into your voice call, you have to use a node server. Below you can see the project structure of the tutorial.
This is a node project and the server is its entry point. There are some xml files with the server's TwiML resposes (don't worry if you don't know what that is) in the templates directory. Also, there is a keys folder, with the server HTTPS keys. Last, but not least, the npm dependencies are:
If you are experienced with node.js, you can skip to the last part of the article ("putting it all together").
Before understanding what the node server does, let's create a simple Twilio server and connect it to a Twilio number in a baby example. This baby example will later grow into our fully functional Twilio front-end for our VA!
Every node.js server for Twilio has an endpoint returning some TwiML. In our case, the server will have an endpoint on path
accepting HTTP POST, responding with the the contents of file
<Response> <Say>This is your Voice Assistant speaking!</Say> </Response>
What does the TwiML do? Simply it informs Twilio on how to respond to the phone call. In this case, it will merely say "This is your Voice Assistant speaking!" with a synthesized voice, before hanging up. This is the essential file for Twilio.
So, before you continue, make sure you have this file. Also, you need to make sure the dependencies are installed (as mentioned above,
). Finally, it is necessary to spin up a server to serve this TwiML. The code responsible for sharing the TwiML is as follows:
Store the code in
. The server's work is done in
is read and then sent back to the client as HTTP POST reponse body. Now you have to install the dependencies (npm-install) and run the server:
Now you should be able to access the TwiML in
curl -d -X https://yourhost:1312/twiml
This should return:
<Response> <Say> This is your Voice Assistant speaking! </Say> </Response>
To make this work one final step is needed. First, install the twilio-cli and setup your twilio credentials. Then, connect the endpoint of the server you just made with your Twilio number:
twilio phone-numbers:update +111234567890 --voice-url http://your.host:1314/twiml
Now, if you have configured your Twilio number correctly, you should be able to call it and listen to the message you have specified with the
Now it's time to change the TwiML to allow bidirectional communication with the call:
<?xml version="1.0" encoding="UTF-8" ?> <Response> <Say>This is your Voice Assistant speaking!</Say> <Connect> <Stream url="wss:your.host:1314/socket" ></Stream> </Connect> </Response>
Save the new TwiML in a new file named
, inside the
What is this new sorcery? In simple terms,
is used to set the websocket to which the call will connect and
is used to make this websocket bidirectional. The latter tag is the one of the additions made in July 2020, which made it easy to use Twilio with VA's.
As you can see, the websocket address has the same host and port as before. The reason for this is that we will handle this connection through the same server that we use for serving the TwiML. We will call this websocket the Twilio websocket. You need one more, one for communicating with the VA. We will call this websocket the VA websocket. In the schema below, note the role of the the node server in the middle:
In essence, the Twilio server is just middleware between the Twilio cloud service (which sends voice audio and receives what should be played back to the human on the phone) and our VA. It is, at the same time, a server for twilio to connect to, and a VA client.
Both websockets are mostly described by their message handler; they do not perform any special actions when closed, neither do they do something special when opened, apart from binding the message handlers.
Let's jump right into the code that handles incoming Twilio messages:
Twilio's messages are named events, and there are 5 types of events:
. They are described in detail in Twilio's docs.
events are just one-off events sent at the beginning and the ending of the call.
event only occurs once too. When using the
tag, Τwilio sends voice audio in base64 encoded chunks, proceeded by a single
message is handy for getting the
of the call (a unique ID used when sending stuff to Twilio) and, of course, it signifies the start of the voice stream.
This is a good time to start the communication with the VA using the
signal. Also it is necessary to turn the
. This attribute has two values:
is during flow A is being sent, while mode
is for the rest of the time. "Waiting" is not a really good name though, as there is stuff going on in the waiting phase too, as we will see in the next section.
event is sent many times: it contains chunks of streaming voice audio, recorded from the user's phone. This kind of message should be sent to the VA. Also, any necessary preprocessing should happen here. e.g changing the encoding.
event carries a message. In Twilio it is often used for notifications. In this case, it is received from the Twilio call when playback of a VA's response has ended (if you're feeling confused, things will clear out when you read the next section). Here, a new conversation turn starts - the user can speak again. Attribute
is set back to
signal is sent to the VA.
Next is the handler of the VA websocket:
Similarly to the handler of the Twilio socket, this one accepts only string messages: it handles the
messages we talked about before and the audio responses from the VA. The former are used to stop the recorded audio from the call from being sent to the VA.
The latter are prefixed with
and are sent to Twilio, to be played back to the user. But this not enough - a mark message is also sent right after the audio. This step is important, because it tells Twilio to notify the server as soon as the sound clip that was just sent has been played to the user. Notice the
field: its value is the same as the name in the mark message Twilio will send afterwards.
Putting it all together
Now it's time to unify things into one big working server. The suggested solution is to create a class named
, and create an instance of this class for every incoming Twilio connection. The two websockets we already described are attributes of
. The setting up of the sockets, i.e. the connection of the client to the VA and the binding of onMessage(), onOpen(), onError() etc is done in the class constructor.
tag only supports wss (the secure websocket protocol). That means that you must have an HTTPS server with the keys ready in a separate keys folder (lines 1-15).
By the way, the options object used to create the server also contains the path for the Twilio websocket endpoint provided in the TwiML. This HTTPS server will be fed into the constructor of the WebsocketServer object. The latter will listen to the endpoint specified in the TwiML (lines 16-20).
2. The code in lines 26-47 is the same as the first part, it is made for serving the TwiML inside the .xml.
3. The next part (lines 49-81) might look a bit perplexing. Essentially, it is the code that establishes the connections with Twilio and the VA. A bit more explantion is due:
Everything starts when Twilio is connected to the websocket endpoint. This causes the creation of a MediaStream object, with the connection object as an argument. The code in the constructor ensures that the processing of messages from Twilio will start after the websocket with the VA has been opened.
This is achieved by binding the
function to the onConnected event, in line 61: when the connection with the VA is complete, then it is safe to bind the message handlers to the websocket connections.
That's all for today. Let me know how it went, and feel free to ask any questions in the comments section. Happy coding!
Create your free account to unlock your custom reading experience.