Voice recognition is a complex problem across a number of industries. Knowing some of the basics around handling audio data and how to classify sound samples is a good thing to have in your data science toolbox. We're going to go through an example of classifying some sound clips using . By the time you get through this, you'll know enough to be able to build your own voice recognition models. With additional research, you can take these concepts and apply them to larger, more complex audio files. Tensorflow You can find the full code in this . Github repo Getting the Data Gathering data is one of the hard problems in data science. There's so much data available, but not all of it is easy to use in machine learning problems. You have to make sure that the data is clean, labeled, and complete. To do our example, we're going to use some . audio files released by Google First, we'll create a new . This is where you'll be able to build, train, and test your model and share a link with anybody else interested: Conducto pipeline path = root = co.Serial(image = get_image()) root[ ] = co.Exec(run_whole_thing, ) root ### # Main Pipeline ### -> co.Serial: def main () "/conducto/data/pipeline" # Get data from keras for testing and training "Get Data" f" /raw" {path} return Then we'll start writing the function: run_whole_thing os.makedirs(out_dir, exist_ok= ) seed = tf.random.set_seed(seed) np.random.seed(seed) data_dir = pathlib.Path( ) : def run_whole_thing (out_dir) True # Set seed for experiment reproducibility 55 "data/mini_speech_commands" Next, we need to set up the directory to hold the audio files: data_dir.exists(): tf.keras.utils.get_file( , origin= , extract= ) if not # Get the files from external source and put them in an accessible directory 'mini_speech_commands.zip' "http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip" True Pre-Processing the Data Now that we have our data in the right directory, we can split it into training, test, and validation datasets. First, we need to write a few functions to help pre-process the data so that it'll work in our model. We need the data in a format our algorithm can understand. We'll be using a convolutional neural network, so the data needs to be transformed into images. This first function will convert the binary audio file into a tensor: audio, _ = tf.audio.decode_wav(audio_binary) tf.squeeze(audio, axis= ) # Convert the binary audio file to a tensor : def decode_audio (audio_binary) return -1 Since we have a tensor we can work with that has the raw data, we need to get the labels to match them. That's what the following function does by getting the label for an audio file from the file path: parts = tf.strings.split(file_path, os.path.sep) parts[ ] # Get the label (yes, no, up, down, etc) for an audio file. : def get_label (file_path) return -2 Next, we need to associate the audio files with the correct labels. We're doing this and returning a tuple that Tensorflow can work with: label = get_label(file_path) audio_binary = tf.io.read_file(file_path) waveform = decode_audio(audio_binary) waveform, label # Create a tuple that has the labeled audio files : def get_waveform_and_label (file_path) return We briefly mentioned using the convolutional neural network (CNN) algorithm earlier. This is one of the ways we can handle a voice recognition model like this is. Typically CNNs work really well on image data and help decrease pre-processing time. We're going to take advantage of that by converting our audio files into spectrograms. A spectrogram is an image of a spectrum of frequencies. If you take a look at an audio file, you'll see it's just frequency data. So we're going to write a function that converts our audio data into images: zero_padding = tf.zeros([ ] - tf.shape(waveform), dtype=tf.float32) waveform = tf.cast(waveform, tf.float32) equal_length = tf.concat([waveform, zero_padding], ) spectrogram = tf.signal.stft( equal_length, frame_length= , frame_step= ) spectrogram = tf.abs(spectrogram) spectrogram # Convert audio files to images : def get_spectrogram (waveform) # Padding for files with less than 16000 samples 16000 # Concatenate audio with padding so that all audio clips will be of the same length 0 255 128 return Now that we have formatted our data as images, we need to apply the correct labels to those images. This is similar to what we did for the original audio files: spectrogram = get_spectrogram(audio) spectrogram = tf.expand_dims(spectrogram, ) label_id = tf.argmax(label == commands) spectrogram, label_id # Label the images created from the audio files and return a tuple : def get_spectrogram_and_label_id (audio, label) -1 return The last helper function we need is the one that will handle all of the above operations for any set of audio files we pass it: files_ds = tf.data.Dataset.from_tensor_slices(files) output_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=autotune) output_ds = output_ds.map( get_spectrogram_and_label_id, num_parallel_calls=autotune) output_ds # Preprocess any audio files : def preprocess_dataset (files, autotune, commands) # Creates the dataset # Matches audio files with correct labels # Matches audio file images to the correct labels return Now that we have all of these helper functions, we get to split the data. Splitting the Data into Datasets Converting audio files to images helps make the data easier to process with a CNN and that's why we wrote all of those helper functions. We'll do a couple of things to make splitting the data more simple. First, we'll get a list of all of the potential commands for the audio files that we'll use in a few other places in the code: commands = np.array(tf.io.gfile.listdir(str(data_dir))) commands = commands[commands != ] # Get all of the commands for the audio files 'README.md' Then we'll get a list of all of the files in the data directory and shuffle them so we can assign random values to each of the datasets we need: filenames = tf.io.gfile.glob(str(data_dir) + ) filenames = tf.random.shuffle(filenames) train_files = filenames[: ] validation_files = filenames[ : + ] test_files = filenames[ :] # Get a list of all the files in the directory '/*/*' # Shuffle the file names so that random bunches can be used as the training, testing, and validation sets # Create the list of files for training data 6400 # Create the list of files for validation data 6400 6400 800 # Create the list of files for test data -800 Now we have our training, validation, and test files clearly separated so we can go ahead and pre-process these files to get them ready to build and test our model. We're using autotune here to tune the value of our parameters dynamically at runtime: autotune = tf.data.AUTOTUNE This first example is just to show how the pre-processing works and it gives us the value that we'll need in a bit: spectrogram_ds files_ds = tf.data.Dataset.from_tensor_slices(train_files) waveform_ds = files_ds.map( get_waveform_and_label, num_parallel_calls=autotune) spectrogram_ds = waveform_ds.map( get_spectrogram_and_label_id, num_parallel_calls=autotune) # Get the converted audio files for training the model Since you've seen what it's like to go through the pre-processing steps, we can go ahead and use the helper function to handle this for all of the datasets: train_ds = preprocess_dataset(train_files, autotune, commands) validation_ds = preprocess_dataset( validation_files, autotune, commands) test_ds = preprocess_dataset(test_files, autotune, commands) # Preprocess the training, test, and validation datasets We want to set a number of training examples that run in each iteration of the epochs so we'll set a batch size: batch_size = train_ds = train_ds.batch(batch_size) validation_ds = validation_ds.batch(batch_size) # Batch datasets for training and validation 64 Lastly, we can reduce the amount of latency in training our model by taking advantage of caching: train_ds = train_ds.cache().prefetch(autotune) validation_ds = validation_ds.cache().prefetch(autotune) # Reduce latency while training Our datasets are finally in a form that we can train the model with. Building the Model Since our datasets are clearly defined, we can go ahead and build the model. We'll be using a CNN to create our model so we'll need to get the shape of the data to get the correct shape for our layers. Then we go ahead build the model sequentially: spectrogram, _ spectrogram_ds.take( ): input_shape = spectrogram.shape num_labels = len(commands) norm_layer = preprocessing.Normalization() norm_layer.adapt(spectrogram_ds.map( x, _: x)) model = models.Sequential([ layers.Input(shape=input_shape), preprocessing.Resizing( , ), norm_layer, layers.Conv2D( , , activation= ), layers.Conv2D( , , activation= ), layers.MaxPooling2D(), layers.Dropout( ), layers.Flatten(), layers.Dense( , activation= ), layers.Dropout( ), layers.Dense(num_labels), ]) model.summary() # Build model for in 1 lambda 32 32 32 3 'relu' 64 3 'relu' 0.25 128 'relu' 0.5 We do some configuration on the model so that it gives us the best accuracy possible: model.compile( optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.SparseCategoricalCrossentropy( from_logits= ), metrics=[ ], ) # Configure built model with losses and metrics True 'accuracy' The model is built so now all that's left is training it. Training the Model After all of the work did pre-processing the data and building the model, training is relatively simple. We determine how many epochs we want to run with our training and validation datasets: EPOCHS = model.fit( train_ds, validation_data=validation_ds, epochs=EPOCHS, callbacks=tf.keras.callbacks.EarlyStopping(verbose= , patience= ), ) # Finally train the model and return info about each epoch 10 1 2 That's it! The model has been trained and now we just need to test it. Testing the Model Now that we have a model with roughly 83% accuracy, it's time we test how well it performs on new data. So we take our test dataset and split the audio files from the labels: test_audio = [] test_labels = [] audio, label test_ds: test_audio.append(audio.numpy()) test_labels.append(label.numpy()) test_audio = np.array(test_audio) test_labels = np.array(test_labels) # Test the model for in Then we take the audio data and use it in our model to see if it predicts the correct label: y_pred = np.argmax(model.predict(test_audio), axis= ) y_true = test_labels test_acc = sum(y_pred == y_true) / len(y_true) print( ) # See how accurate the model is when making predictions on the test dataset 1 f'Test set accuracy: ' {test_acc: %} .0 Finishing the Pipeline There's just a tiny bit of code that you'll need to finish your pipeline and make it shareable with anyone. This defines the image that will be used in this Conducto pipeline and handles the file execution: co.Image( , copy_dir= , reqs_py=[ , , ], ) __name__ == : co.main(default=main) ### # Pipeline Helper functions ### : def get_image () return "python:3.8-slim" "." "conducto" "tensorflow" "keras" if "__main__" Now you can run in your terminal and it should spin up a link to a new Conducto pipeline. If you don't have an account, you can python pipeline.py --local make one for free here. Conclusion This is one of the ways you can solve an audio processing problem, but it can be much more complex depending on what data you're trying to analyze. Building it in a pipeline makes it easy to share with coworkers and get help or feedback if you run into bugs. Previously published here .
Share Your Thoughts