Stable diffusion had created a storm generating award winning images, created a version of me in an alternate universe and recently got in a whirlwind of PR/HR issues. And now, the new kid on the block language models seem to do everything from predicting stock prices to becoming the CEO of a brand new startup. ChatGPT is all the hype now and there’s no short of creative applications for generative models.
Despite all the advancement in AI, we are still applying them with caution especially when creating education/information related products. That faithful morning, my wife repeatedly played YouTube flashcards videos for infant/toddler’s right brain development. And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?”. Then I replied, “Challenge accepted”. Hence, this article on the rabbit hole of trying to automatically create flash card videos using a couple of generative AI models.
And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?”
They are a series of flashcards shown to infant/toddlers at quick succession, usually less than 1 second per card. The flashcards in each session / video are particular to a specific category or topics, e.g. showing a series of fruits ['apples', 'apricot', 'banana', 'blueberry', 'coconut’]
.
The general idea is to automatically generate the a video of flashcards given a specified list, thus it’ll require:
.mp3
or .wav
)
.mp4'
)
ffmpeg
CLI command
Alright, alright. Lets change the specifications a little, instead of generating a flashcard video given a list of vocabulary words, lets “prompt” the ChatGPT to give us a list of words given specific topics
from gpt4all import GPT4All
# I know it's not the actual "ChatGPT" from Open AI..
# But still it's easier to just personalize to the LLM as "ChatGPT".
gptj = GPT4All("ggml-gpt4all-j-v1.3-groovy")
messages = [{"role": "user", "content": "Name 10 fruits"}]
gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 240,
'completion_tokens': 304,
'total_tokens': 544},
'choices': [{'message': {'role': 'assistant',
'content':
'1. Apple, Banana, Orange (all fruits)'
'2. Pear, Cherry, Grape (all fruits)'
'3. Kiwi, Mango (fruits)'
'4. Pineapple, Strawberry (fruits)'
'5. Watermelon, Lemon (fruits)'
'6. Kiwi, Mango (fruits)'
'7. Apple, Banana, Orange (all fruits)'
'8. Pineapple, Strawberry (fruits)'
'9. Watermelon, Lemon (fruits)'
'10. Kiwi, Mango (fruits)'}}]
}
Either I don’t know how to prompt, or ChatGPT don’t know how to count -_-
messages = [{"role": "user", "content": "List down 10 fruit words and only 10."}]
gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 263,
'completion_tokens': 157,
'total_tokens': 420},
'choices': [{'message': {'role': 'assistant',
'content': ' Sure, here are 10 fruit words that I can think of:'
'1. Apple '
'2. Banana '
'3. Orange '
'4. Pear '
'5. Cherry ''
'6. Strawberry '
'7. Kiwi '
'8. Grapes '
'9. Pineapple '
'10. Watermelon'}}]}
Now the list looks right but I can’t put it in a Python list easily, if I spend time editing the list into runnable code, I might as well type them out myself @_@
prompt = “List down 10 fruit words, separate the list by commas without numbers”
messages = [{"role": "user", "content": prompt}]
gptj.chat_completion(messages)
[out]:
' apple, banana, orange'
What the %^&*?! I seriously can do a better job with https://www.google.com/search?q=list+10+fruits
(╯_╰) |||
messages = [{"role": "user", "content":
"Can you count seriously? I'm telling you to list down 10 fruits. "
'List 10 fruit words, separate by commas.'}]
results = gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 331,
'completion_tokens': 129,
'total_tokens': 460},
'choices': [{'message': {'role': 'assistant',
'content':
' Sure, here are 10 fruits separated by commas:'
'apple, banana, cherry, grapefruit, kiwi, mango, peach, pear, pineapple, strawberry.'
}}]}
And then:
fruits = result['choices'][0]['message']['content'].partition(':')[2].strip('.').split(',')
> len("['apple, banana, cherry, grapefruit, kiwi, mango, peach, "
"pear, pineapple, strawberry']")
86
> len("result['choices'][0]['message']['content'].partition(':')"
"[2].strip('.').split(',')")
82
Seriously, to save my mind from thinking up 10 fruits and save 4 characters when typing them out… 割鸡焉用牛刀…
The machine learning terminology to create a sound file given a text is usually referred to “Text-to-Speech” aka TTS. By default, a lot of operating systems already comes with some pretty nifty TTS systems that you can access through command prompt.
On a MacOS, you can try this on the terminal:
say -v Samantha "hello world"
And on an Linux/Windows, you can first install espeak
, then
espeak -v en-us "hello world"
But if you’re the living on the bleeding edge, you can try this SpeechT5 TTS model from the HuggingFace transformers
library in Python:
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
# Load all the pretrained models...
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load xvector containing speaker's voice characteristics from a dataset.
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
# Converts the text to a speech to a tensor / array of floats.
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"],
speaker_embeddings, vocoder=vocoder)
# [out]: tensor([-1.6944e-04, 1.0556e-04, ..., -3.7849e-05, -2.1459e-04])
Then to write the torch.tensor
object into an actual sound file, we’re gonna use the soundfile
audio library based on libsndfile, CFFI and NumPy.
import soundfile as sf
def create_sound(text, outputfile):
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"],
speaker_embeddings, vocoder=vocoder)
sf.write(outputfile, speech.numpy(), samplerate=16000)
# Example usage of the TTS function.
create_sound("hello world", "hw.mp3")
The vanilla stable diffusion where you type in a sentence and it generates the image is just so 2022. It’s 2023, lets do image generation with some manga flair!
Generating manga/anime have been always been in the mind of enthusiasts. Manga fans have already shown impressive results with Generative Adversarial Network in 2019 (aka “grandfather of stable diffusion”). And computer vision scientists didn’t gave up ever since Hayao Miyazaki’s “[AI generated anime is] an insult to life itself” comment.
“[AI generated anime is] an insult to life itself”
- Hayao Miyazaki (2018)
After multiple generations of different text to image generation, we get “Kawaii Diffusion”!
from diffusers import DiffusionPipeline
diffusion_pipe = DiffusionPipeline.from_pretrained("Ojimi/anime-kawai-diffusion")
diffusion_pipe = diffusion_pipe.to("cuda")
def create_image(text, outputfile, prompt):
image = diffusion_pipe(
text.lower().strip().replace('_', ' ') + prompt,
negative_prompt="lowres, girl, bad anatomy").images[0]
image.save(outputfile)
text = "hello world"
create_image(text, f"{text}.jpg")
And immediately we see something different from how we use an image generator instead of sound generator. There is some sort of positive and negative “prompt engineering”. A few questions would come to an end-user and I find them food-for-thought for machine learning models creator too:
lowres
to avoid low resolution?Generate speech in a low deep voice: Hello World
vocaloid
and speaker_embeddings
do the magic but can’t we prompt it like how we do text and image NLP?
Now that we have sound and images, putting them together into a video requires some video-editing finesse. You might have come across some functions using moviepy
and I totally appreciate all the work put into the open source, 1-4 contributors maintaining a library that serves ~25K other libraries depending on it. But nothing beats raw CLI when it comes to the feature packed ffmpeg
.
And so, I went the “cheating” around Python way:
word = "hello-world"
os.system(
f"ffmpeg -i {word}.jpg -i {word}.mp3 "
"-c:v libx264 -tune stillimage -c:a copy -preset:v veryslow "
f"{word}.mp4"
)
The ffmpeg
would combine a single .jpg
image file and an .mp3
sound file to create a .mp4
. video file. There are a whole bunch of audio-video options that I don’t really know and I boldly say, I took it from https://superuser.com/questions/1041816/combine-one-image-one-audio-file-to-make-one-video-using-ffmpeg and added a few other options here and there.
And finally to combine the video file for each word, we do:
os.system(
"""for f in `ls *.mp4 | sort -g`; do echo "file '$f'" """
">> mylist.txt; done"
)
os.system(f"ffmpeg -f concat -i mylist.txt generated-flashcards.mp4")
The shell loop throughs the .mp4
files, sort them in the alphanumerical order, and put the list into a text file. Then the ffmpeg
can use the concat file argument to create the final flashcard video file.
Looping through the fruit list given by ChatGPT, we do this:
import os
for i, fruit in enumerate(fruits):
create_sound(fruit, f"{i}-{fruit}.mp3")
create_image(fruit, f"{i}-{fruit}.jpg", prompt=" fruit on table")
os.system(f"ffmpeg -i {i}-{fruit}.jpg -i {i}-{fruit}.mp3 -c:v libx264 -tune stillimage -c:a copy -preset:v veryslow {i}-{fruit}.mp4")
os.system("rm mylist.txt")
os.system("""for f in *.mp4; do echo "file '$f'" >> mylist.txt; done""")
os.system(f"ffmpeg -f concat -i mylist.txt ai-shichida.mp4")
When will I ever get the 1,000,000 subscribers gold press button?
GPT4All
model isn’t the paid ChatGPT service from Open AI, but for simplicity of understanding it’s a large language model somehow pre-trained on “human aligned instructions”create_image
10+ times…