I am an author, futurist, systems architect, public speaker and pro blogger.
(My "Learning AI If You Suck at Math" Series returns after a three year hiatus! When last I wrote the field was still developing and I felt I'd covered the state of that art completely. All that's changed. We've had a number of major breakthroughs in the last few years, like GANs, reinforcement learning and Transformers, and it's time to return to the keyboard to see what's changed. If you missed the earlier articles, be sure to check them out: 1, 2, 3, 4, 5, 6, 7.)
Can AI make beautiful music?
In part eight of the Learning AI If You Suck at Math series, my team and I dig deep to find out if neural nets can compose Ambient music with the great masters of the art.
Ambient is a soft, flowing, ethereal genre that I’ve loved for decades. There are all kinds of ambient, from white noise, to tracks that mimic the murmur of soft summer rain in a sprawling forest. But I favor ambient that weaves together environmental sounds and dreamy, wavelike melodies into a single, lush tapestry.
"Cuando el Sol Grita la Mañana," by Leandro Fresco, featured on the 'Pop Ambient 2013' album, exemplifies the genre for me. It’s almost instantaneously calming, filled with fluttering bells in the distance of a windy landscape of sound.
Can machine learning ever hope to craft something so seemingly simple yet intricate?
The answer is that it’s getting closer and closer with each passing year. It won’t be long before artists are co-composing with AI, using software that helps them weave their own masterpieces of sound.
Here’s three of the absolute best sample songs that our machine learning model created:
In this article, we'll look at how we did it.
Along the way we’ll listen to some more awesome samples. Of course, some samples came out really well and some didn’t work as well as we hoped, but overall the project worked beautifully.
I’ll also give you the trained model to play around with yourself!
Finally, I’ll also show you an end-to-end machine learning pipeline, with downloadable containers that you can string together with ease to train a masterful music-making machine learning model of your very own.
So let’s dive in and make some beautiful music together. But first let’s take a whirlwind tour of the history of music and technology to see how we got here.
Tech and music have always shared a deeply intertwined personal history. In many ways, without technology, there's no such thing as music.
Of course, humans don’t need technology to create harmonious vibrations that delight our ears. We’ve always had our voices to sing sweet songs of joy and sorrow. But beyond the human voice it’s our use of tools and machines that have really given us the gift of “mousike, the art of the Muses,” as the ancient Greeks called it.
Instruments themselves are nothing but highly tuned technology. Ancient peoples stretched dried animal skins over barrels and made drums. They hollowed out reeds to make flutes. They drew fine strings over forged wooden frames to craft harps and mandolins.
After tens of thousands of years of analog instruments, the modern world seized on the rise of electricity to create a new class of instruments. In 1919, Russian inventor, Leon Theremin gave us the eerie and delightful Theremin that people played by waving their arms between two antennas charged with an electrostatic field between them.
You can hear the strange and otherworldly instrument in movies like The Day the Earth Stood Still (40 seconds into the clip).
The 1960s gave us synthesizers that used analog circuitry to create wonderful new sounds no one had ever heard. The original ones were as big as a room but eventually, an incredible team of engineers and musicians, Bob Moog and Herbert Deutsch, managed to miniaturize the synthesizer in the 1970s. That led to the iconic keyboard synth that dominated 1980s pop music videos with bands like Duran, Duran and Talking Heads.
But the 1980s version of the synthesizer was a very different beast all together. It came from another room sized technology that got miniaturized after the second world war:
1980s synthesizers were purely digital machines, rather than clever combinations of electronic circuitry like their 1960s and 70s counterparts.
Today, we can synthesize the synthesizer.
While each analog synth was a masterwork of electrical engineering, we can duplicate any sound in a computer. Whether you’re an amateur or pro musician, you can load up LogicPro or Cuebase or Ableton Live and have an entire multi-million dollar recording studio right in your lap.
Artificial Intelligence is only the latest technological tool to make its mark on the music world.
But for many years the dream of intelligent machines making music that mirrors the complexity of the human heart and spirit seemed painfully far away.
(Bach courtesy of Deep Dream Generator)
Even as machine learning advanced rapidly on multiple fronts like speech and image recognition it lagged in the making of beautiful melodies.
But has that changed in the last few years? Have we managed to advance the state of the art in artificial music design?
The good news is we’ve come a very long way in a very short time.
If I’d written part eight of this series when I wrote the original Learning AI If You Suck at Math pieces in 2017 the answer was a clear no. But only a few years later, we’re much closer to the dream of the artist in the machine because of a series of algorithmic breakthroughs that have changed the game.
There’s still further to go but it’s not hard to imagine a future where we easily co-compose songs with our intelligent software.
Douglas Eck leads Google’s Magenta project, the project my team and I turned to help us on our quest. Eck and his team don’t want to replace artists. They want to augment them.
In the awesome book, The Artist in the Machine, author Arthur Miller tells us that "fully autonomous creative machines aren't on Eck’s radar. He doesn’t want to step back and “watch a machine create art.”
Instead he wants to give artists brand new tools to create something amazing.
Think human in the loop (H-I-L). Think jamming with the AI. Imagine a constant and ever evolving creative feedback loop with the machine.
In the decade, you’ll download a library of a dozen pre-trained music models and quickly generate variations on the guitar riff or drum solo you dreamed up last night. You’ll play a few notes you’re working on, feed it to the machine and watch it fire off 50 continuations of that riff. You’ll flip through them, listening quickly and then suddenly closely, captivated by the unexpected rhythm of one of the variations.
“That’s what I’m looking for,” you’ll think. So you’ll tell the AI to iterate on those notes. Now you have 50 more variations and you dig into them.
The seventh one makes you stop dead in your tracks, mesmerized by it.
It inspires you to finish the melody in a brand new way. You play it all in a whirlwind, the notes pouring out of you.
We’re still a little ways off from that but we can see the early seeds of it right now with the power of machine learning.
Like any good machine learning story it all starts with the data.
I didn’t choose ambient music randomly. I chose it because of a deep love for the genre.
Over the last fifteen years, I’ve crafted a highly curated playlist of ambient music that you can follow on Spotify that I used as the base dataset to train our model.
The playlist has survived over time, as the way we listened to music changed around me. It started on CDs and downloaded mp3s and it moved to iTunes, before finding its current home on Spotify. I expand the list very, very slowly. Only songs that don’t disrupt the subtle streaming tapestry of the entire theme make the cut. Jarring and discordant notes get swiftly culled.
I listen to this playlist every day.
To help me write.
I’m listening to it right now as I write this article. Ambient music is utterly fantastic at helping you move into an alpha wave state as swiftly as possible. That’s the meditative, creative state of supreme concentration that comes when you get totally lost in an activity like meditation or dancing or writing. The playlist helps calm me down and focus me for the deep work of writing. After so many years, it acts almost like a Pavlovian trigger now to shift me into that wonderful state of concentration that we call Flow.
Flow is when everything just “flows” and you get lost in the moment, totally absorbed in what you’re doing. Time seems to disappear. The past and the future are gone. Your mind drops away. You don’t have to think about the next step and the next, you just know what to do with effortless ease.
There are lots of different kinds of ambient music. Not all of them are great at inducing Flow. Some of it more like dance music or Berlin techno. Some ambient is highly experimental, pushing the bounds of what a song is supposed to be with no melody to speak of, and some of it just mimics the natural world with looped recordings of rain or bells or blowing wind.
But to me, ambient music is ethereal and otherworldly, able to put the mind at ease quickly.
There’s also a second, more subtle reason that I chose this kind of music.
What the model generates doesn’t have to be completely perfect.
If one or two notes are out of place, they blend into the overall flow of the song. That’s a big difference from something like piano or drum music, which many machine learning music projects have focused on. If a drum beat is off or a note is out of place in a piano piece, it sticks out like a rusty nail from the whole.
In other words, the hazy softness of ambient music gave us a little more leeway to screw it up just a bit but still have something that sounds wonderful overall.
But now that I had my choice of music and my dataset I needed to figure out what algorithms would deliver on the promise of the playlist. To do that I surveyed the field, reading lots of articles and papers to find out what worked and what didn’t work.
I settled on two potential approaches before eventually settling on one winner:
Both of them have their strengths and weaknesses and to understand why we just need to dig into them a little bit deeper.
Originally, my gut told me that Wavenet was the way to go so that’s where I started my research with my team.
Wavenets are convolutional neural networks that come to us from Google’s Deep Mind and it’s achieved state of the art results in text to speech synthesis. They’re behind the magic of the Google Assistant which can call and make dinner reservations for you. These networks have the distinct advantage of looking at the raw waveform rather than at a symbolic representation of music, like sheet music notes or MIDI.
A waveform is a representation of an electronic signal depicted as a time series graph.
(Source: DeepMind blog post introducing Wavenets)
While a digital waveform is still a symbolic representation of that signal, there is a lot of data captured in it. You can think of each waveform as a unique digital signature for sounds. Since we can encode any sound electronically and represent that signal as a waveform, we can capture all the nuances of music and speech in one beautiful representation.
To be fair, there is some information that gets lost when we digitally record a signal. The wave is sampled. That means we’re recording tiny little chunks of the sound at very fast intervals, not the entire sound.
When CDs first came out there were a lot of debates about the “sample rate,” aka how fast and often we recorded those chunks to capture the “true” sound. Analog recordings, like vinyl records, capture the entire sound wave and were considered “warmer” to the trained ear. But we’ve largely solved that problem with super high sample rates, such that the signal loss is imperceptible to the human ear.
Contrast that with a different symbolic representation, such as written dialogue in a screenplay. You can read the dialogue but the words don’t capture any nuance of how an actor would say those words. None of the actual sounds live in that symbolic representation. An actor might shout the words or say them slowly and deeply or with a high pitched voice. Each actor will sound totally different and none of that is captured in the screenplay.
Wavenet is an “autoregressive model,” which means the model can work with time series data, using observations about the previous time step to predict the next step. If you look at all the little dots on the waveform in the DeepMind graphic each of those represents a time step where the next sample is taken. I can try to predict the next step by looking at the last one and the current one.
But the problem is that it’s not really enough to look back only one step to figure out what the next step is in a sequence. As they say, two points on a graph is not a trend and three points is only the beginning of the trend. The next step after the first two might simply change direction abruptly.
Wavenet tries to model longer term structure using convolutions. Convolutions have proved incredible for image recognition problems and they’re covered in Part 5 of the Learning AI If You Suck at Math series. They capture local, close together patterns very well. When a convolutional neural net processes the pixels in an image, it includes “convolutional” layers that passe a small box over the image, looking for small patterns.
When those convolutions are stacked on top of each other, they build up into more complex representations. So one convolutional layer might capture dark next to light pixels and that’s an edge. Another layer might capture blobs of the same color. Put all those layers together and the system can start to recognize higher order objects like noses or lips and then finally it can “see” higher order objects like people and cats!
Convolutions are fantastic at capturing little clusters of data that are close together. That’s why they handle visual data so beautifully. Visual data’s essential patterns tend to show up in clumps.
That’s made wavenets great at capturing short term patterns in sound and it’s why it works so well in text to speech synthesis. It can capture the beginning and end of words and how we string them together, making the synthesized words flow together more naturally rather than sounding jarring and disconnected like earlier speech engines.
Notice how much more natural Wavenets make synthetic speech sound.
Here’s a sample of an artificial voice from a previous best in class system called Concatative:
Now here’s a sample generated by Wavenet.
That’s a big leap forward for text to speech.
Unfortunately, music doesn’t work that way.
Music has both short and long term patterns that play out over time. A short term pattern might be a few seconds of a chorus, but a longer term pattern is how often that chorus is repeated or how the song sounds as a whole.
Teams that tried to use Wavenets to capture music realized that it doesn’t capture anything close to a coherent long term musical structure.
In fact, they sound practically insane the longer they go on.
They capture sequences of five or tens seconds just fine but when you try to generate longer term sequences they break down horribly. Longer songs are all over the place, as if someone jammed together a dozen songs every ten seconds.
The music above kind of makes sense for short bursts but mostly it’s an incoherent mess.
Theoretically we could fix the problem by capturing longer term structures in each layer of our neural net. If we could capture a pattern across a second of music in a lower convolutional layer and then feed that knowledge up to a higher order network we could capture the long term musical structure.
Unfortunately, Wavenets are already a beast to train, even with top-of-the-line GPUs or TPUs, taking many minutes to model even a single second of audio. Capturing bigger convolution patterns makes the memory and processing power grow linearly.
As Sander Dieleman notes in his incredible and comprehensive survey of the state of the art in musical generation in 2020, “In 10 years, the hardware we would need to train a WaveNet with a receptive field of 30 seconds (or almost half a million timesteps at 16 kHz) may just fit in a desktop computer, so we could just wait until then to give it a try. But if we want to train such models today, we need a different strategy."
Perhaps the most revolutionary architecture of the last few years is the Transformer, which surprised me with its versatility in Natural Language Processing. It’s behind mega-models like GPT-2 and GPT-3, that OpenAI just released as a commercial API product to build amazing new next-gen apps.
If you read my last article in the Learning AI If You Suck at Math series: The Magic of NLP, you’ll remember that I didn’t find NLP all that magical. That’s because, the dominant method of dealing with time series and natural language data in 2017 was the LSTM, a recurrent neural network.
I didn’t find the LSTM all that good at dealing with long term memory because it’s not very good at it. LSTMs are recurrent neural nets (RNN). They push forward like snow plows, remembering a little bit about the few time steps behind them, but forgetting long term structure and knowing nothing about the steps ahead. Bidirectional LSTMs do exist and they give models a better sense of context in each direction but every RNNs tend to have a terrible recency bias and forget most deeper connections when they try to understand patterns.
I tried to generate good titles for my latest book and found NLP text generation an interesting toy that wasn’t all that much better than random title generators so I confidently declared my job as a writer safe from the AI revolution.
Then came the Transformer architecture just a few months later.
Researchers at Google delivered the Transformer in the landmark paper called Attention is All You Need and a blog post breaking it all down for us. The novel architecture was designed to solve neural machine translation tasks for Google Translate. It dumped convolutions and recurrence altogether and delivered a network that was highly parallelizable and much easier to train.
The architecture uses “attention” to give neural networks a much deeper long term memory.
What the heck is “attention" and how does that help us generate cool music?
As Chris Nicholson writes in his excellent introduction to attention:
"Attention takes two sentences, turns them into a matrix where the words of one sentence form the columns, and the words of another sentence form the rows, and then it makes matches, identifying relevant context.”
Check out the graphic from the Attention is All You Need paper below. It’s two sentences, in different languages (French and English), translated by a professional human translator. The attention mechanism can generate a heat map, showing what French words the model focused on to generate the translated English words in the output.
(Source: Attention is All You Need)
But the amazing thing about attention mechanisms is that you don’t need to lay out two different sentences. You can lay out the same sentence, which we call “self attention." With self-attention we can turn transformers back on themselves to learn about the important words in the same sentence, rather than the difference between two translations of a sentence.
Jay Alammar explains self attention beautifully in his post on Transformers and he gives us an excellent visual to understand how relating things in the sentence make a big difference in understanding that sentence.
In a Transformer, each word is encoded with information about the relevance and relation of every other word in the sentence in a key value pair that the model can query. Take the graphic below, which focuses on just one word, “it,” and how it relates to every other word. Does “it” refer to “the animal” or “the road”? Humans have little trouble figuring that out but natural language models struggled with it for decades. But here we can see that the darkest orange colors on the heat map surround “the animal” because those words have the highest probability according to the model of referring to “it.”
(Source: Jay Almmar’s The Illustrated Transformer)
The math a model uses to decide relevance and importance is complex and there are a number of ways to do it, covered in multiple papers that detail attention mechanisms. But all you need to know is that each word gets encoded with how important it is to every other word. That means Transformers don’t just know about the few words that came before it, like RNNs, while knowing nothing about the words to come. Transformers knows about every word in the sentence, forwards and backwards. That means they excel at both small and larger clusters of information and how they relate to each other.
And that’s the real power of the Transformer. It's much better at developing a long term memory about what it’s learned.
Since one of the biggest problems in music is long term structure, the good folks at the Magenta project realized attention mechanisms might work well on music if they made a few modifications. Specifically, they used “relative attention” to create the Music Transformer, which they explain beautifully on their blog:
"While the original Transformer allows us to capture self-reference through attention, it relies on absolute timing signals and thus has a hard time keeping track of regularity that is based on relative distances, event orderings, and periodicity. We found that by using relative attention, which explicitly modulates attention based on how far apart two tokens are, the model is able to focus more on relational features. Relative self-attention also allows the model to generalize beyond the length of the training examples, which is not possible with the original Transformer model.”
By modeling the relationships of notes, Music Transformer does much better at capturing the long term coherence of a single song and how the notes all relate to each other. The Magenta team visualizes how some of these notes connect to each other in their paper on their algorithm for relative self attention. Just like the word “it” you can see how one note is connected by lines to many other notes in the song. Each note is encoded with information about its relevance to other notes.
The more I read on Music Transformer, the more I realized I had one of the most cutting edge approaches to making beautiful music.
Pachyderm makes it supremely simple to link together a bunch of loosely coupled frameworks into a smoothly scaling AI generating machine. The platform can track data lineage, do data/code/model version control and easily stack lots of experimental and cutting edge packages together like beautiful beads on a string. If you can package up your program in a Docker container you can easily run it in Pachyderm.
I created a complete walkthrough that you can check out right here on Github. It allows you to recreate the entire pipeline and train the model yourself. If you want to train your own models you can jump right to the section called Training Your Own Music Generator in Pachyderm.
It’s time to make some music!
If you don’t want to recreate the entire pipeline, then I’ve made it super easy for you to play with our trained model and generate your own music anyway.
I’ve shared several fully trained models you can download right here and try out yourself without doing any training at all. There are two Ambient music models and two Berlin Techno models. I also included a Docker container with the fully trained Ambient model and seed files so you can start making music fast.
Make sure you have Docker Desktop installed and running, then pull down the Ambient Music Transformer container to get started.
In just a few minutes you’ll be able to generate your own songs with ease.
Once you’ve generated a few songs, there’s one last step. That’s where we bring in our human-in-the-loop creativity.
We play our MIDI through various software instruments to see how it sounds. The software instruments are what bring our song to life.
Different instruments create very different songs. If you play your MIDI through a drum machine or a piano it will sound like random garbage because ambient music is more irregular than a drum or piano concert.
But pick the right ambient instrument and you might just have musical magic at your fingertips.
Of course, you could automate this step but you’ll need to find a rich collection of open source software instruments. They’re out there, but if you have a Mac you already have a rich collection of software instruments in Apple’s Logic Pro. That felt like the best place to start so I could try lots of iterations fast. If you don’t own Logic Pro you can install a 60 day trial version from the Mac Store that is fully featured and not crippleware.
If you don’t want to use Logic Pro, there’s lots of amazing music creation software to choose from, like Abelton Live and Cuebase. You can also use Garageband, which is free. If you’re a musical magician then go wild and unleash your favorite software collection on those AI generated songs.
But if you’re using Logic Pro like me, then you just import the MIDI and change out the instrument in the software track.
After days of experimentation I found a few amazing software instruments that consistently delivered fantastic songs when the Music Transformer managed to spit out a great sample.
Here are some of my favorite samples:
Several software instruments give the very same MIDI a strong, sci-fi vibe that feels otherworldly, as if I was flying through space or dropped down into a 1980s sci-fi synthwave blockbuster:
Here are a few great sci-fi examples:
Music Transformer does a terrific job depending on what you feed it. Some seeds really make the Music Transformer sing.
But not all seeds are created equal. It really struggles sometimes.
No software instrument can save a terrible output no matter how hard you try.
The model doesn’t always perform well. Sometimes it generates weird samples that don’t sound great no matter what software instruments we run them through.
Here’s a bizarre, nine minute song that got generated from Leandro Fresco with bits of total silence and the same long, mournful notes throughout.
Here’s another sample just sounds like 18 seconds of oscillating fans:
We strongly suspect that if we had a big corpus of MIDI files that came directly from the musical artists themselves we’d have an even stronger model. That would skip the imperfect transcription step and deliver something much closer to their original visions.
We could have also tried fine tuning the original transcription model away from just piano, we could probably get better ambient transcriptions, leading to better generative models.
But overall we love the results that Music Transformer delivered.
Considering how many of the decisions of this project and approach were based on the very human intelligence traits of gut instinct and intuition, it worked out much better than expected.
It could have all gone horribly wrong.
The internet is littered with AI music creation gone wrong.
Music and AI share a brilliant future together. A lot of folks worry that savvy producers will simply pump out pop hits via a machine and cut out artists all together. That will happen in some cases but I don’t see it as the norm. People love to connect with singers.
What’s a great pop hit without a Boy Band for a brand new generation of screaming teenage girls to fall in love with once more?
If anything, AI will create an explosion of new kinds of creative jobs. We’ll see artists co-composing with their AIs. We’ll see 3D animators creating virtual boy bands and engineers staging holographic concerts. We’ll see AI and DJs creating music on the fly, based on moods and the wild gyrations of the late night dance crowd.
In short, AI won’t destroy music. Far from it. It’ll usher in an amazing new era of music.
As for the AI models and algorithms, they’ll only continue to get better. We’ll see attention models built to work on raw waveforms. But we’ll also see an explosion of new processing power that will make the older algorithms work with incredible precision.
I sometimes wonder if we’ll solve the problem of music generation with sheer brute force? In other words, we don’t need better algorithms, we just need more processing horsepower.
Take something like real time ray tracing in video games and movies. That’s where we trace the path of light as it bounces around an environment. We’d figured out the math to do it back in the days of Amiga in the 1980s. But we couldn’t do it in real time until recently because we just didn’t have the chips to pull it off. In the last few years, real time ray tracing has taken off with a vengeance and it instantly makes games and movies look incredible without a lot more work from developers.
Check out this incredible Star Wars demo with ray tracing shining bright
AI algorithms will get a similar boost as more and more companies and governments stamp out new silicon to power the next-gen of AI apps and tools.
All of our AI models stand on the shoulders of giants. The ideas that underpin deep learning data back to the 1940s. We had back propagation in the 1960s. Convolutional networks came to us in the 1970s from Kunihiko Fukushima who created a system called Neocognitron that could recognize digits.
But deep learning never really took off until AlexNet used GPUs to chew through ImageNet. We just didn’t have the power.
Today, Wavenet is a monster to train, even on state of the art GPUs or TPUs. The results are realistic because it’s working directly with the digitally encoded audio sample, so it can pick up on voices and multiple instruments. But tomorrow Wavenet will get easier and easier as we hurl more cores at it.
And when that happens we’ll see a brand new generation of musicians weaving soundscapes we can only just begin to imagine now.
Artists have always used the latest tools and technology to advance the arts. Art itself is infinitely malleable and adaptable to the artists of the age.
AI stands to give them the greatest tool music has ever seen.
I can’t wait to see what tomorrow's master musicians create with their intelligent machines.
Dan Jeffries is Chief Technology Evangelist at Pachyderm. He’s also an author, engineer, futurist, pro blogger and he’s given talks all over the world on AI and cryptographic platforms. He’s spent more than two decades in IT as a consultant and at open source pioneer Red Hat.
With more than 50K followers on Medium, his articles have held the number one writer's spot on Medium for Artificial Intelligence, Bitcoin, Cryptocurrency and Economics more than 25 times. His breakout AI tutorial series "Learning AI If You Suck at Math" along with his explosive pieces on cryptocurrency, "Why Everyone Missed the Most Important Invention of the Last 500 Years” and "Why Everyone Missed the Most Mind-Blowing Feature of Cryptocurrency,” are shared hundreds of times daily all over social media and been read by more than 5 million people worldwide.
If you love my work please visit my Patreon page because that’s where I share special insights with all my fans.
Top Patrons get EXCLUSIVE ACCESS to so many things:
Early links to every article, podcast and private talk. You read it and hear first before anyone else! A monthly virtual meet up and Q&A with me. Ask me anything and I’ll answer.I also share everything I’m working on and give you a behind the scenes look at my process.
Create your free account to unlock your custom reading experience.