Text-to-image generation is not a new idea. Notably, the Generative Adversarial Network (GAN) architecture, a once-popular deep-learning computer vision algorithm had generated birds and flowers from text.
After more improvements to the generative image algorithms like hyperealistic generation of human faces and GLIDE diffusion model, we now have a deluge of commercial text to image generators, ranging from Google’s Imagen, Tiktok’s Greenscreen and OpenAI Dall-E.
Stable Diffusion
Before moving ahead, for those who wants to understand the nitty-gritties of diffusion models, I strongly encourage readers to go through this awesome blogpost by Lilian Weng, https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ and this tweet by Tom Goldstein.
The rest of the article would be an exercise of using the stable diffusion model to generate images given “my name”. There is no main purpose for doing so, other than satisfying my curiosity.
Be honest, don’t we all “google your own name” from time to time? And thus, I went to https://huggingface.co/spaces/stabilityai/stable-diffusion and typed in Liling Tan
.
Before I show the generated results, I would like to clarify some specifics about my first and last name. In general, it is more probable that Liling
a English romanization of my Chinese character name refers to a female individual than a male. Next Tan
(陳) is a common Southern Chinese name that originates from the Min language, and more commonly you would expect the English translation/romanization of Mandarin Chinese, Chen
. With K-Pop being a global phenomenon, the top-ranked Google result of Chen
would point you to the Korean Singer from the K-pop band EXO. Also, since Tan
(my last name), has the same spelling as tan
the color, the top Google search will end up with results pointing to “a yellowish-brown color”.
For reference, here are the control experiments results, if I googled “liling tan”
I kind of expected the image to show an Asian female but the first image was kinda weird.
Interestingly, the model has some internal mechanisms to block some "unsafe” content and output this error message.
This Image was not displayed because our detection model detected Unsafe content.
So I do the natural thing to reverse image search with https://lens.google.com/
And of course, what was I expecting, surely Google Shopping will take the chance to advertise and sell me something -_-
But, we see two separate dots that indicates two other results exists, let’s see the first one at the bottom.
Let’s try the results other dot. Now this is interesting, it’s trying to promote a Girl with a Pearl Earring (ca. 1665) by Johannes Vermeer inspired piece.
Does it really find the source of the images that the model use to slice, dice and “diffuse”?
It’s hard to say:
Now that we find the underbelly from the results, lets backtrack the OG paper listed on the original source code https://github.com/CompVis/stable-diffusion
High-Resolution Image Synthesis with Latent Diffusion Models. Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer. In CVPR '22.
According to the paper, the model is pre-trained on the LAION database https://laion.ai/blog/laion-5b/ and the Conceptual Captions dataset https://ai.google.com/research/ConceptualCaptions/.
Our model is pretrained on the LAION [73] database and finetuned on the Conceptual Captions [74] dataset.
Perhaps if we can find the approximate nearest neighbor images to the outputs generated by the model, then it might be possible to “explain away” which images the model had deemed to be “salient” enough to diffuse and generate the outputs from my name.
Voila! Here’s a search engine to probe the dataset used to train Stable Diffusion, https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/
Liling
in the LAION database.Ah ha, that is one image that the model must have “chosen to diffuse”
It is highly possible that the model somehow “focused” an image in the training data with “Liling” in the caption and generated an image similar to that image.
To conclude the post, this exercise is purely out of curiosity. I wanted to know what the model would generate. But this exercise also highlights some sort of bias when using generative models. While “Liling Tan” isn’t stable enough to generate anything close to me or any other top-ranked “Liling Tan” search results, the model seems to be more stable in generating famous people, (e.g. John Oliver).
To end this article, here’s a result from generating images with my online handle
alvations
as the prompt…