Stable diffusion had created a storm , created a and recently . And now, the new kid on the block language models seem to do everything from to . ChatGPT is all the hype now and there’s no short of creative applications for generative models. generating award winning images version of me in an alternate universe got in a whirlwind of PR/HR issues predicting stock prices becoming the CEO of a brand new startup Despite all the advancement in AI, we are still applying them with caution especially when creating education/information related products. That faithful morning, my wife repeatedly played YouTube flashcards videos for . And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?”. Then I replied, “Challenge accepted”. Hence, this article on the rabbit hole of trying to automatically create flash card videos using a couple of generative AI models. infant/toddler’s right brain development And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?” What are Right Brain Development Flashcards? They are a series of flashcards shown to infant/toddlers at quick succession, usually less than 1 second per card. The flashcards in each session / video are particular to a specific category or topics, e.g. showing a series of fruits . ['apples', 'apricot', 'banana', 'blueberry', 'coconut’] https://youtu.be/-shR2MAXT0Q?t=71 The general idea is to automatically generate the a video of flashcards given a specified list, thus it’ll require: Some way to generate a sound file of each word in the list ( or ) .mp3 .wav Ah-ha!, ! text-to-speech (TTS) models Some how create an image file given a word / string Definitely, some stable diffusion thing A way to combine the image with the sound files into a video ( ) .mp4' I don’t know the solutions yet but it’s surely some CLI command ffmpeg Vänta ett ögonblick!(Wait a minute!) Where is ChatGPT in this? Alright, alright. Lets change the specifications a little, instead of generating a flashcard video given a list of vocabulary words, lets “prompt” the ChatGPT to give us a list of words given specific topics Give ChatGPT a topic and ask it to give me a list of words for kids’ language learning Some way to generate a sound file of each word in the list Some how create an image file given a word / string A way to combine the image with the sound files into a video Hey ChatGPT, “ " Name 10 fruits from gpt4all import GPT4All # I know it's not the actual "ChatGPT" from Open AI.. # But still it's easier to just personalize to the LLM as "ChatGPT". gptj = GPT4All("ggml-gpt4all-j-v1.3-groovy") messages = [{"role": "user", "content": "Name 10 fruits"}] gptj.chat_completion(messages) [out]: {'model': 'ggml-gpt4all-j-v1.3-groovy', 'usage': {'prompt_tokens': 240, 'completion_tokens': 304, 'total_tokens': 544}, 'choices': [{'message': {'role': 'assistant', 'content': '1. Apple, Banana, Orange (all fruits)' '2. Pear, Cherry, Grape (all fruits)' '3. Kiwi, Mango (fruits)' '4. Pineapple, Strawberry (fruits)' '5. Watermelon, Lemon (fruits)' '6. Kiwi, Mango (fruits)' '7. Apple, Banana, Orange (all fruits)' '8. Pineapple, Strawberry (fruits)' '9. Watermelon, Lemon (fruits)' '10. Kiwi, Mango (fruits)'}}] } My reaction: Either I don’t know how to prompt, or ChatGPT don’t know how to count -_- Hey ChatGPT, “ ” List down 10 fruit words and only 10. messages = [{"role": "user", "content": "List down 10 fruit words and only 10."}] gptj.chat_completion(messages) [out]: {'model': 'ggml-gpt4all-j-v1.3-groovy', 'usage': {'prompt_tokens': 263, 'completion_tokens': 157, 'total_tokens': 420}, 'choices': [{'message': {'role': 'assistant', 'content': ' Sure, here are 10 fruit words that I can think of:' '1. Apple ' '2. Banana ' '3. Orange ' '4. Pear ' '5. Cherry '' '6. Strawberry ' '7. Kiwi ' '8. Grapes ' '9. Pineapple ' '10. Watermelon'}}]} My reaction: Now the list looks right but I can’t put it in a Python list easily, if I spend time editing the list into runnable code, I might as well type them out myself @_@ Hey ChatGPT, “ ” List down 10 fruit words, separate the list by commas without numbers prompt = “List down 10 fruit words, separate the list by commas without numbers” messages = [{"role": "user", "content": prompt}] gptj.chat_completion(messages) [out]: ' apple, banana, orange' My reaction: What the %^&*?! I seriously can do a better job with https://www.google.com/search?q=list+10+fruits (╯_╰) ||| I’m going to try some passive-aggression with ChatGPT messages = [{"role": "user", "content": "Can you count seriously? I'm telling you to list down 10 fruits. " 'List 10 fruit words, separate by commas.'}] results = gptj.chat_completion(messages) [out]: {'model': 'ggml-gpt4all-j-v1.3-groovy', 'usage': {'prompt_tokens': 331, 'completion_tokens': 129, 'total_tokens': 460}, 'choices': [{'message': {'role': 'assistant', 'content': ' Sure, here are 10 fruits separated by commas:' 'apple, banana, cherry, grapefruit, kiwi, mango, peach, pear, pineapple, strawberry.' }}]} And then: fruits = result['choices'][0]['message']['content'].partition(':')[2].strip('.').split(',') My reaction: > len("['apple, banana, cherry, grapefruit, kiwi, mango, peach, " "pear, pineapple, strawberry']") 86 > len("result['choices'][0]['message']['content'].partition(':')" "[2].strip('.').split(',')") 82 Seriously, to save my mind from thinking up 10 fruits and save 4 characters when typing them out… 割鸡焉用牛刀… Now that the ChatGPT and prompt-engineering clickbait part is over, time to get serious with the flashcard generation. From Text to Speech The machine learning terminology to create a sound file given a text is usually referred to “ ” aka TTS. By default, a lot of operating systems already comes with some pretty nifty TTS systems that you can access through command prompt. Text-to-Speech On a MacOS, you can try this on the terminal: say -v Samantha "hello world" And on an Linux/Windows, you can , then first install espeak espeak -v en-us "hello world" But if you’re the living on the bleeding edge, you can try this model from the HuggingFace library in Python: SpeechT5 TTS transformers import torch from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan from datasets import load_dataset # Load all the pretrained models... processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") # Load xvector containing speaker's voice characteristics from a dataset. embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) # Converts the text to a speech to a tensor / array of floats. inputs = processor(text=text, return_tensors="pt") speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder) # [out]: tensor([-1.6944e-04, 1.0556e-04, ..., -3.7849e-05, -2.1459e-04]) Then to write the object into an actual sound file, we’re gonna use the . torch.tensor soundfile audio library based on libsndfile, CFFI and NumPy import soundfile as sf def create_sound(text, outputfile): inputs = processor(text=text, return_tensors="pt") speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder) sf.write(outputfile, speech.numpy(), samplerate=16000) # Example usage of the TTS function. create_sound("hello world", "hw.mp3") Stable Diffusion (Manga Mode!) The vanilla stable diffusion where you type in a sentence and it generates the image is just so 2022. It’s 2023, lets do image generation with some manga flair! Generating manga/anime have been always been in the mind of enthusiasts. Manga fans have already shown in 2019 (aka “grandfather of stable diffusion”). And computer vision scientists didn’t gave up ever since comment. impressive results with Generative Adversarial Network Hayao Miyazaki’s “ [AI generated anime is] an insult to life itself ” “ ” [AI generated anime is] an insult to life itself - Hayao Miyazaki (2018) After multiple generations of different text to image generation, we get “ ”! Kawaii Diffusion Give me the code already!! from diffusers import DiffusionPipeline diffusion_pipe = DiffusionPipeline.from_pretrained("Ojimi/anime-kawai-diffusion") diffusion_pipe = diffusion_pipe.to("cuda") def create_image(text, outputfile, prompt): image = diffusion_pipe( text.lower().strip().replace('_', ' ') + prompt, negative_prompt="lowres, girl, bad anatomy").images[0] image.save(outputfile) text = "hello world" create_image(text, f"{text}.jpg") And immediately we see something different from how we use an image generator instead of sound generator. There is some sort of positive and negative “prompt engineering”. A few questions would come to an end-user and I find them food-for-thought for machine learning models creator too: Why is it that we need to prompt to get a good image generation? E.g. why do we need to explicitly do negative prompt to avoid low resolution? lowres Why is it that we don’t prompt text-to-speech models? E.g. why can’t we do something like Generate speech in a low deep voice: Hello World I guess that’s where selecting a and do the magic but can’t we prompt it like how we do text and image NLP? vocaloid speaker_embeddings Can abstract ideas like “Hello World” be generated into an image? Can an artist draw an abstract idea without research/inspiration? Are there other that modern TTS models can’t emulate? modalities of speech ( phonopragmatics ) Going back to Flashcard Generation Now that we have sound and images, putting them together into a video requires some video-editing finesse. You might have come across and I totally appreciate all the work put into the open source, that serves ~25K other libraries depending on it. But nothing beats raw CLI when it comes to the feature packed . some functions using moviepy 1-4 contributors maintaining a library ffmpeg And so, I went the “ ” around Python way: cheating word = "hello-world" os.system( f"ffmpeg -i {word}.jpg -i {word}.mp3 " "-c:v libx264 -tune stillimage -c:a copy -preset:v veryslow " f"{word}.mp4" ) The would combine a single image file and an sound file to create a . video file. There are a whole bunch of audio-video options that I don’t really know and I boldly say, I took it from and added a few other options here and there. ffmpeg .jpg .mp3 .mp4 https://superuser.com/questions/1041816/combine-one-image-one-audio-file-to-make-one-video-using-ffmpeg And finally to combine the video file for each word, we do: os.system( """for f in `ls *.mp4 | sort -g`; do echo "file '$f'" """ ">> mylist.txt; done" ) os.system(f"ffmpeg -f concat -i mylist.txt generated-flashcards.mp4") The shell loop throughs the files, , and put the list into a text file. Then the can use the concat file argument to create the final flashcard video file. .mp4 sort them in the alphanumerical order ffmpeg Looping through the fruit list given by ChatGPT, we do this: import os for i, fruit in enumerate(fruits): create_sound(fruit, f"{i}-{fruit}.mp3") create_image(fruit, f"{i}-{fruit}.jpg", prompt=" fruit on table") os.system(f"ffmpeg -i {i}-{fruit}.jpg -i {i}-{fruit}.mp3 -c:v libx264 -tune stillimage -c:a copy -preset:v veryslow {i}-{fruit}.mp4") os.system("rm mylist.txt") os.system("""for f in *.mp4; do echo "file '$f'" >> mylist.txt; done""") os.system(f"ffmpeg -f concat -i mylist.txt ai-shichida.mp4") Ta-da! Now we’ve created an AI generated flashcard video… https://youtu.be/O-abE4CqbBg My reaction: When will I ever get the 1,000,000 subscribers gold press button? Epilogue / Disclaimers The full code to create your own flash card is available on https://www.kaggle.com/alvations/generative-language-learning-flashcard I understand that the model isn’t the paid ChatGPT service from Open AI, but for simplicity of understanding it’s a large language model somehow pre-trained on “ ” GPT4All human aligned instructions Note that the “ ” attempted for this post is much longer than the examples posted in the “ChatGPT” section above, I’ve tried a lot more prompts before the passive-aggressive one. prompt engineering You might have to loop through the fruits and re-create the images multiple times to get a generated image that you desire. The final video in this post was created after looping through the 10+ times… create_image Here’s a list of flashcards I’ve created using the same TTS + Diffusion code: Things that move (Vehicles) Animals (Chinese Zodiac) Fruits (The one shown in this post)