Before diving into details let me explain quickly what this project is about.
I come from the generation that watched Robotech while growing up. As any kid in those days, I was inspired by animated characters becoming real in my imagination through their artistic expression. One of more recent inspirations are Kizuna AI, the virtual band Gorillaz and many others.
For some time now I was trying to find the shortest, laziest path of creating a virtual Youtuber with technology that's available to everyone, right at this moment.
It seems plausible to have an AI create video content on-demand in a matter of minutes with good enough results to demonstrate what I believe will be perfected by the generations to come.
After trying out the same idea with project AIYA and GPT2 I didn't get the results that were coherent enough without constantly tweaking and adjusting it.
Even then, it sounded bad and took a couple of hours of work for every video. That's just not a viable workflow!
I wanted something quick, intuitive and creativity boosting, some new uncharted territory!
But like always first tries are doomed to fail. You can see how that first try went on a youtube video here and if you don’t mind hacky code, you can check the github repo.
Now, that GPT3 is finally here I got excited after seeing all the demos and examples of what it can do. But with it being accessible only to those who OpenAI allows (despite the name), I had to look elsewhere.
Luckily, I found GPT-Neo which is an open-source replica of GPT3's architecture, and chose it as the next best text-generator alternative. This is also possible thanks to the good people at Huggingface who provided the updated transformer model to the public.
Unlike my older project that had thousands of lines of code, this one is done with not more then 400 lines of Python.
That being said.
Prepared ahead of time:
It could be another video in the background like some gameplay or animated gifs, but I chose to stick with static background for now.
I would write a quick prompt or a question on the subject of the video and let GPT finish creating the rest. I decided to generate up to 600 characters, which is fine for a three-minute video.
After generating the text, I used Text-to-speech to convert text to audio and give Arty her voice. With some tweaking of pitch and sample rate, added randomized patches of silence, she gets to have a bit more pleasant and dynamic character voice.
Here Google-Cloud-Speech kicks in. The audio file I just created is sent to Google Cloud to get the transcription with timed captions for each word. Hint, for audio longer than one minute Google asks you to upload the file to GC Storage first and pass the bucketed link.
Some random noise is always added to the process just by the fact that this transcription process is also probabilistic and it returns different confidence levels and slightly different transcription every time. There are a lot of bloopers and funny, misheard words that make it almost charming. It's a feature, not a bug! Or so I choose it to be!
After getting the transcript, now it's time to extract some nouns and verbs from the text with an exact notation on when those words appear in audio. Why? You will see.
Here I used another amazing open-source NLP tool called SpaCy to detect and extract phrases from the generated text.
Now, with a nice list of some (misheard) phrases and their timing, let's search and download images based on those terms. We will also sample and randomize it every time so that each search produces different images.
These images now are added and mixed with the audio file using lengthy FFMPEG command and timed captions - to overlay the correct meme at the exact time it was uttered in a video. In the end, you get a nice understandable video with audio of the character talking and almost human-like meme edits, randomly chosen from the internet. Voilà!
Be careful it may catch you by surprise!
Just like misheard lyrics are quirky, if you chuckle on some memes, you may be tricked to think Arty is the real deal, at least for a split second. These imperfections are the secret souce that give humour to her videos.
Can it drag you into its story and then make you laugh? You decide!
The whole procedure of creating a video from a prompt is automated and takes about 10-15 minutes on Google Collab with a good GPU or TPU (recommended). It takes an image and some starting sentences... and you get a youtube ready video!
As a final twist and a feature - you can provide your own voice and short text - and get a brand new, original video in a matter of minutes, completely replacing the AI part with you own content.
The next step will be to get the subjects of Arty’s videos chosen by the people of Twitter!
Anyone can experiment with these technologies and create his own videos, add unique editing style, tell a better story or just perfect the currently available toolset for even better, more creative results.
Follow Arty on Youtube for most recent videos!
Follow Arty on Twitter!