Stable Diffusion, Unstable Me: Text-to-image Generationby@alvations
3,027 reads
3,027 reads

Stable Diffusion, Unstable Me: Text-to-image Generation

by Liling TanSeptember 5th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Text to image generation is not a new idea. Stable Diffusion is the new kid on the block. What if, you feed <your name> to a state-of-the-art image generation model?

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Stable Diffusion, Unstable Me: Text-to-image Generation
Liling Tan HackerNoon profile picture

Text-to-image generation is not a new idea. Notably, the Generative Adversarial Network (GAN) architecture, a once-popular deep-learning computer vision algorithm had generated birds and flowers from text.

Figure 1 from Reed et al. (2006) "Generative Adversarial Text to Image Synthesis"

After more improvements to the generative image algorithms like hyperealistic generation of human faces and GLIDE diffusion model, we now have a deluge of commercial text to image generators, ranging from Google’s Imagen, Tiktok’s Greenscreen and OpenAI Dall-E.

Google Imagen samples from Saharia et al. (2022)

And here comes the new kid on the block, Stable Diffusion

Before moving ahead, for those who wants to understand the nitty-gritties of diffusion models, I strongly encourage readers to go through this awesome blogpost by Lilian Weng, and this tweet by Tom Goldstein.

The rest of the article would be an exercise of using the stable diffusion model to generate images given “my name”. There is no main purpose for doing so, other than satisfying my curiosity.

Narcissistic as it sounds, I want to generate an image with my name.

Be honest, don’t we all “google your own name” from time to time? And thus, I went to and typed in Liling Tan.

Screenshot of the Stable Diffusion Demo on Huggingface

Before I show the generated results, I would like to clarify some specifics about my first and last name. In general, it is more probable that Liling a English romanization of my Chinese character name refers to a female individual than a male. Next Tan (陳) is a common Southern Chinese name that originates from the Min language, and more commonly you would expect the English translation/romanization of Mandarin Chinese, Chen . With K-Pop being a global phenomenon, the top-ranked Google result of Chen would point you to the Korean Singer from the K-pop band EXO. Also, since Tan (my last name), has the same spelling as tan the color, the top Google search will end up with results pointing to “a yellowish-brown color”.

For reference, here are the control experiments results, if I googled “liling tan”

Google Search image results for "liling tan"

And now, the results…

Outputs of Diffusion Model when given "Liling Tan" as the prompt

Okay, that’s definitely nowhere close to how I look.

I kind of expected the image to show an Asian female but the first image was kinda weird.

Now, what happens if I lowercase and re-run the generation?

Outputs of Diffusion Model when given "liling tan" (lowercased) as the prompt

Alright, the model seems hell-bent on some facial features and sorta generating one older and another younger version of “liling tan

Interestingly, the model has some internal mechanisms to block some "unsafe” content and output this error message.

This Image was not displayed because our detection model detected Unsafe content.

Then I got really curious, do the two versions of “liling tan” exist IRL (in real life)?

So I do the natural thing to reverse image search with

Search by Image - Result 1 (face only)

Search by Image - Result 2 (face only)

Hmmm, no results, lets extend the frame to beyond the face…

And of course, what was I expecting, surely Google Shopping will take the chance to advertise and sell me something -_-

Search by Image - Result 1 (full image)

Maybe the older version generated by the model is more grounded to someone IRL?

Hmmm, no results, so the model kinda generated two unique people that doesn’t exist IRL?

But, we see two separate dots that indicates two other results exists, let’s see the first one at the bottom.

Of course, it will try to sell me something again… What was I expecting? @_@

Let’s try the results other dot. Now this is interesting, it’s trying to promote a Girl with a Pearl Earring (ca. 1665) by Johannes Vermeer inspired piece.

But what about “Find image source”?

Does it really find the source of the images that the model use to slice, dice and “diffuse”?

It’s hard to say:

Find the Image Source - Result 2

Đợi tí (Wait a minute),does that mean that we don’t know which image the model has been trained on or used to splice before generating the results?

Now that we find the underbelly from the results, lets backtrack the OG paper listed on the original source code

High-Resolution Image Synthesis with Latent Diffusion Models. Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer. In CVPR '22.

According to the paper, the model is pre-trained on the LAION database and the Conceptual Captions dataset

Our model is pretrained on the LAION [73] database and finetuned on the Conceptual Captions [74] dataset.

What if, we build a reverse image search based on the datasets?

Perhaps if we can find the approximate nearest neighbor images to the outputs generated by the model, then it might be possible to “explain away” which images the model had deemed to be “salient” enough to diffuse and generate the outputs from my name.

Voila! Here’s a search engine to probe the dataset used to train Stable Diffusion,

And here’s the list of results of searching for Liling in the LAION database.

Ah ha, that is one image that the model must have “chosen to diffuse”

One of the image search results from searching "Liling" on

It is highly possible that the model somehow focused” an image in the training data with “Liling” in the caption and generated an image similar to that image.

Why did you bother to “diffuse <your_name>”?

To conclude the post, this exercise is purely out of curiosity. I wanted to know what the model would generate. But this exercise also highlights some sort of bias when using generative models. While “Liling Tan” isn’t stable enough to generate anything close to me or any other top-ranked “Liling Tan” search results, the model seems to be more stable in generating famous people, (e.g. John Oliver).

To end this article, here’s a result from generating images with my online handle alvations as the prompt…

Thế à! (Interesting!) A magic card?! - Images generated with "alvations" as prompt

Hopefully, after reading, you would also try to “diffuse <your_name>”!