So long, point-and-click. Generative AI is upending traditional UX and all kinds of new and exciting design patterns are emerging. The proliferation of LLMs has transformed the once-annoying voice-based interface into something that’s truly engaging and a bar-raiser for user connection to your app. You don’t need an LLM for everything, but I do highly recommend them for voice assistants.
But even with great AI, audio is still hard. When building audio recording functionality on the frontend, you’ll quickly run into a few common challenges: capturing audio from the user, processing it for server-side use, and integrating that with a backend like GraphQL. In this post I’ll walk you through a simple approach to recording and sending audio for transcription using a combination of JavaScript's MediaRecorder
API and OpenAI's Whisper model.
The method I’m showing here is not ideal for production-scale applications, but it is the quickest and most effective way to prototype audio features for your next MVP.
There are two key objects to understand when recording audio from a browser:
A MediaStream represents a stream of media content coming from a device like a microphone, webcam, or remote content coming in over a network cable. It supports all kinds of media content in addition to plain audio, and is used to get real-time access to the underlying media.
A MediaRecorder is responsible for actually capturing data from a MediaStream
. A MediaStream
on its own is good enough if all you need to do is transmit or display the media as it arrives, but in order to record media for later use--like sending it off to Whisper--you need to buffer the stream and save it to a file, which is what MediaRecorder
does.
Understanding these two objects and how they relate to each other is half the battle.
With our knowledge of the browser media APIs, we’re now ready to start writing some code. The first step is to get hold of a MediaStream
that’s connected to a device that can record audio:
// Obtain a reference ot a `MediaStream` which supports
// audio capture. Calling this method will cause the
// browser to ask for the user's consent for your app
// to record audio.
let mediaStream: MediaStream;
try {
// Calling this method will cause the browser to ask for the user's consent to let your app record audio!
mediaStream = await navigator.mediaDevices.getUserMedia({
audio: true,
});
} catch (error) {
// If the user declines to give consent, an exception is thrown.
alert('Cannot record without permission!');
return;
}
With a MediaStream
in hand, it’s now possible to construct a MediaRecorder
that will buffer the incoming audio content. Audio codecs are a bit tricky here as OpenAI are picky about the formats their API will accept. The MediaRecorder
class has a static method called isTypeSupported
which tells me whether a given MIME type is supported by my user’s browser. Modern browsers should all support WebM which is a pretty good format, but it’s possible to check a few different MIME types and settle on the first one that’s supported by the user’s browser:
// Find a MIME type supported by the user's browser
const mimeType = ['audio/webm', 'audio/mpeg', 'audio/wav'].find((it) => MediaRecorder.isTypeSupported(it));
if (!mimeType) {
alert('Your browser does not support a suitable recording format.');
return;
}
// Create a `MediaRecorder` that's attached to the `MediaStream` we just obtained.
// Record in the codec we just found.
const recorder = new MediaRecorder(mediaStream, {
mimeType,
});
The MediaRecorder
won’t do anything until it’s told to start recording, and even when it is told to start recording we won’t be able to get anything out of it. Like many other browser APIs, MediaRecorder
is event-based. Specifically, it will emit an event called dataavailable
whenever it finishes recording a chunk of audio.
The audio chunk is sent to the event handler as a Blob
instance--a bunch of arbitrary binary data--which cannot be natively sent to a GraphQL backend. Fortunately, it’s possible to convert this data into a more usable format like a data URL.
Data URLs are just strings, which means they can easily be sent to a GraphQL backend as an input argument. This is great for prototyping, but bear in mind they’re not the most appropriate solution for a production use case. Converting the binary blob into a data URL expands its size by about 33% which increases the amount of data transferred quite significantly.
The general format of a data URL looks like this:
data:[<mediatype>][;base64],<data>
And here is an excerpt of a webm audio clip I recorded while writing this post:
data:audio/webm;codecs=opus;base64,GkXfo59Ch...
It’s important to note that data URLs cannot be directly passed to OpenAI for transcription. After the server receives the data URL, it’s necessary to parse it back to a binary format that OpenAI’s API does recognize. More on that later.
Here’s the last bit of frontend code needed:
// This function takes a blob of audio data and turns it to a data URL
function dataUrlFromBlob(
blob: Blob,
): Promise<string> {
return new Promise((resolve, reject) => {
const reader = new FileReader();
reader.onabort = reject;
reader.onerror = reject;
reader.onload = () =>
resolve(reader.result as string);
reader.readAsDataURL(blob);
});
}
// Add an event handler to process audio chunks as they come in
recorder.ondataavailable = async (event) => {
// We can't send raw binary to our GraphQL server,
// so we encode the audio as a data URL.
const dataUrl = await dataUrlFromBlob(event.data);
await sendAudioToGraphQL({
input: { dataUrl },
});
};
// Start recording!
recorder.start();
The MediaRecorder
’s start
method also takes an optional timeslice
parameter. If you don’t provide it--like me--then the MediaRecorder
will keep buffering audio until its stop
method is called, and then the event handler will receive the entire batch of audio in one go. This isn’t always desirable, so timeslice
can be used to tell the MediaRecorder
to record in smaller chunks and send them to the dataavailable
handler as they come in. A timeslice
of 2000
means the event handler will receive an audio chunk for every 2 seconds of recording--great for if you want to show the user a transcript of what they’re saying in real-time!
Of course, the most robust solution to parsing data URLs is to pull in an off-the-shelf library such as data-urls
. This isn’t as fun or efficient as rolling your own, however, so we’re going to write our own. Data URLs in general can come in a variety of different formats, but for our purposes they only come in one simple format that’s easy enough to handle.
Rolling our own saves a lot of code from the final application. The data-urls
library is ~260.1 kB after minification, which is a decent amount of additional heft in a serverless context. Bundle size matters a lot for cold start time.
In fact if your use case is purely to do transcription of the audio and you don’t care about retaining the data for future use, all you really need is the stuff after the base64,
separator. OpenAI’s audio APIs are smart enough to automatically infer the content type of any audio data you send—so it’s theoretically possible to completely ignore the media type portion of the data URL.
In practice you generally do want to store a copy of the raw data somewhere, which requires knowledge of the content type. Unlike OpenAI, object stores like S3 are generally not intelligent enough to infer the MIME type of your uploaded objects.
The code for parsing data URLs looks like this:
import { toFile } from 'openai';
import type { FileLike } from 'openai/uploads';
async function parseAudioDataUrl(
dataUrl: string,
): Promise<FileLike> {
const [header, base64String] = dataUrl.split('base64,');
const mimeType = header.split(':')[1].split(',')[0];
const buffer = Buffer.from(base64String, 'base64');
const file = await toFile(buffer, 'audio.webm', {
type: mimeType,
});
return file;
}
Note that I’ve passed a hardcoded audio.webm
value as the name of my file. Mapping the mimeType
to the correct file extension, but it’s unnecessary because OpenAI will infer the type of the file anyway. It’s even possible to skip passing the type
option—I’m just doing it so that later on I can access file.type
when uploading the audio to S3. The only requirement here is that you parse out the base64 string correctly and load it into a Buffer
, and that the name you pass to toFile
ends with an extension that the OpenAI client is amenable to.
Calling toFile
with a file name that’s missing an extension, or has an extension the client doesn’t like will yield the following error message:
Invalid file format. Supported formats: ['flac', 'm4a', 'mp3', 'mp4', 'mpeg', 'mpga', 'oga', 'ogg', 'wav', 'webm']
With this helper function in hand, it’s now possible to parse the incoming data URL and transcribe it. Here’s an example Pothos resolver which does exactly that:
builder.mutationField('audioFragmentSubmit', (t) =>
t.field({
type: AudioFragment,
nullable: false,
args: {
input: t.arg({
type: AudioFragmentSubmitInput,
required: true,
}),
},
resolve: async (_, args, ctx) => {
const file = await parseAudioDataUrl(args.input.dataUrl);
const t = await ctx.openai.audio.transcriptions.create({
file,
language: 'en',
model: 'whisper-1',
});
return {
// ...
transcript: t.text,
};
},
}),
);
This solution is not appropriate for production environments, but it’ll get you off the ground without needing to spin up any additional infrastructure or dealing with any external resources. Uploading audio straight to object storage from the frontend is a much more scalable solution once you’ve shipped and won some customers.
But for simple MVPs—it doesn’t get much easier than this.