While OpenAI is delaying the release of the advanced Voice Modes for ChatGPT, I want to share how we built our LLM voice application and integrated it into an interactive booth.
At the end of February, Bali hosted the Lampu festival, arranged according to the principles of the famous Burning Man. According to its tradition, participants create their own installations and art objects.
My friends from Camp 19:19 and I, inspired by the idea of Catholic confessionals and the capabilities of the current LLMs, came up with the idea of building our own AI confessional, where anyone could talk to an artificial intelligence.
Here's how we envisioned it at the very beginning:
To test the concept and start experimenting with a prompt for the LLM, I created a naive implementation in one evening:
To implement this demo, I entirely relied on the cloud models from OpenAI: Whisper, GPT-4, and TTS. Thanks to the excellent library speech_recognition, I built the demo in just a few dozen lines of code.
import os
import asyncio
from dotenv import load_dotenv
from io import BytesIO
from openai import AsyncOpenAI
from soundfile import SoundFile
import sounddevice as sd
import speech_recognition as sr
load_dotenv()
aiclient = AsyncOpenAI(
api_key=os.environ.get("OPENAI_API_KEY")
)
SYSTEM_PROMPT = """
You are helpfull assistant.
"""
async def listen_mic(recognizer: sr.Recognizer, microphone: sr.Microphone):
audio_data = recognizer.listen(microphone)
wav_data = BytesIO(audio_data.get_wav_data())
wav_data.name = "SpeechRecognition_audio.wav"
return wav_data
async def say(text: str):
res = await aiclient.audio.speech.create(
model="tts-1",
voice="alloy",
response_format="opus",
input=text
)
buffer = BytesIO()
for chunk in res.iter_bytes(chunk_size=4096):
buffer.write(chunk)
buffer.seek(0)
with SoundFile(buffer, 'r') as sound_file:
data = sound_file.read(dtype='int16')
sd.play(data, sound_file.samplerate)
sd.wait()
async def respond(text: str, history):
history.append({"role": "user", "content": text})
completion = await aiclient.chat.completions.create(
model="gpt-4",
temperature=0.5,
messages=history,
)
response = completion.choices[0].message.content
await say(response)
history.append({"role": "assistant", "content": response})
async def main() -> None:
m = sr.Microphone()
r = sr.Recognizer()
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
with m as source:
r.adjust_for_ambient_noise(source)
while True:
wav_data = await listen_mic(r, source)
transcript = await aiclient.audio.transcriptions.create(
model="whisper-1",
temperature=0.5,
file=wav_data,
response_format="verbose_json",
)
if transcript.text == '' or transcript.text is None:
continue
await respond(transcript.text, messages)
if __name__ == '__main__':
asyncio.run(main())
The problems we had to solve immediately became apparent after the first tests of this demo:
We had a choice of how to solve these problems: by looking for a suitable engineering or product solution.
Before we even got to code, we had to decide how the user would interact with the booth:
To detect a new user in the booth, we considered several options: door opening sensors, floor weight sensors, distance sensors, and a camera + YOLO model. The distance sensor behind the back seemed to us the most reliable, as it excluded accidental triggers, such as when the door is not closed tightly enough, and did not require complicated installation, unlike the weight sensor.
To avoid the challenge of recognizing the beginning and end of a dialog, we decided to add a big red button to control the microphone. This solution also allowed the user to interrupt the AI at any moment.
We had many different ideas about implementing feedback on processing a request. We decided on an option with a screen that shows what the system is doing: listening to the microphone, processing a question, or answering.
We also considered a rather smart option with an old landline phone. The session would start when the user picked up the phone, and the system would listen to the user until he hung up. However, we decided it is more authentic when the user is "answered" by the booth rather than by a voice from the phone.
In the end, the final user flow came out like this:
Arduino monitors the state of the distance sensor and the red button. It sends all changes to our backend via HTTP API, which allows the system to determine whether the user has entered or left the booth and whether it is necessary to activate listening to the microphone or start generating a response.
The web UI is just a web page opened in a browser that continuously receives the system's current state from the backend and displays it to the user.
The backend controls the microphone, interacts with all necessary AI models, and voices the LLM responses. It contains the app's core logic.
How to code a sketch for Arduino, properly connect the distance sensor and the button, and assemble it all in the booth is a topic for a separate article. Let's briefly review what we got without going into technical details.
We used an Arduino, more precisely, the model ESP32 with a built-in Wi-Fi module. The microcontroller was connected to the same Wi-Fi network as the laptop, which was running the backend.
Complete list of hardware we used:
The main components of the pipeline are Speech-To-Text (STT), LLM, and Text-To-Speech (TTS). For each task, many different models are available both locally and via the cloud.
Since we didn't have a powerful GPU on hand, we decided to opt for cloud-based versions of the models. The weakness of this approach is the need for a good internet connection. Nevertheless, the interaction speed after all optimizations was acceptable, even with the mobile Internet we had at the festival.
Now, let's take a closer look at each component of the pipeline.
Many modern devices have long supported speech recognition. For example, Apple Speech API is available for iOS and macOS, and Web Speech API is for browsers.
Unfortunately, they are very inferior in quality to Whisper or Deepgram and cannot automatically detect the language.
To reduce processing time, the best option is to recognize speech in real-time as the user speaks. Here are some projects with examples of how to implement them:
With our laptop, the speed of speech recognition using this approach turned out to be far from real-time. After several experiments, we decided on the cloud-based Whisper model from OpenAI.
The result of the Speech To Text model from the previous step is the text we send to the LLM with the dialog history.
When choosing an LLM, we compared GPT-3.5. GPT-4 and Claude. It turned out that the key factor was not so much the specific model as its configuration. Ultimately, we settled on GPT-4, whose answers we liked more than the others.
Customization of the prompt for LLM models has become a separate art form. There are many guides on the Internet on how to tune your model as you need:
We had to experiment extensively with the prompt and temperature settings to make the model respond engagingly, concisely, and humorously.
We voice the response received from the LLM using the Text-To-Speech model and play it back to the user. This step was the primary source of delays in our demo.
LLMs take quite a long time to respond. However, they support the response generation in streaming mode - token by token. We can use this feature to optimize the waiting time by voicing individual phrases as they are received without waiting for a complete response from the LLM.
We use the time while the user listens to the initial fragment to hide the delay in processing the remaining parts of the response from the LLM. Thanks to this approach, the response delay occurs only at the beginning and is ~3 seconds.
async generateResponse(history) {
const completion = await this.ai.completion(history);
const chunks = new DialogChunks();
for await (const chunk of completion) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
chunks.push(delta);
if (chunks.hasCompleteSentence()) {
const sentence = chunks.popSentence();
this.voice.ttsAndPlay(sentence);
}
}
}
const sentence = chunks.popSentence();
if (sentence) {
this.voice.say(sentence);
}
return chunks.text;
}
Even with all our optimizations, a 3-4 second delay is still significant. We decided to take care of the UI with feedback to save the user from the feeling that the response is hung. We looked at several approaches:
We settled on the last option with a simple web page that polls the backend and shows animations according to the current state.
Our AI confession room ran for four days and attracted hundreds of attendees. We spent just around $50 on OpenAI APIs. In return, we received substantial positive feedback and valuable impressions.
This small experiment showed that it is possible to add an intuitive and efficient voice interface to an LLM even with limited resources and challenging external conditions.
By the way, the backend sources available on GitHub