paint-brush
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models: Societal Impactby@gamifications

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models: Societal Impact

by GamificationsJuly 18th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this study, researchers introduce an iterative procedure that, at each stage, identifies a coherent set of images sharing a similar identity.
featured image - The Chosen One: Consistent Characters in Text-to-Image Diffusion Models: Societal Impact
Gamifications HackerNoon profile picture

Authors:

(1) Omri Avrahami, Google Research and The Hebrew University of Jerusalem;

(2) Amir Hertz, Google Research;

(3) Yael Vinker, Google Research and Tel Aviv University;

(4) Moab Arar, Google Research and Tel Aviv University;

(5) Shlomi Fruchter, Google Research;

(6) Ohad Fried, Reichman University;

(7) Daniel Cohen-Or, Google Research and Tel Aviv University;

(8) Dani Lischinski, Google Research and The Hebrew University of Jerusalem.

C. Societal Impact

We believe that the emergence of technology that facilitates the effortless creation of consistent characters holds exciting promise in a variety of creative and practical applications. It can empower storytellers and content creators to bring their narratives to life with vivid and unique characters, enhancing the immersive quality of their work. In addition, it may offer accessibility to those who may not possess traditional artistic skills, democratizing character design in the creative industry. Furthermore, it can reduce the cost of advertising, and open up new opportunities for small and underprivileged entrepreneurs, enabling them to reach a wider audience and compete in the market more effectively.


On the other hand, as any other generative AI technology, it can be misused by creating false and misleading visual content for deceptive purposes. Creating fake characters or personas can be used for online scams, disinformation campaigns, etc., making it challenging to discern genuine information from fabricated content. Such technologies underscore the vital importance of developing generated content detection systems, making it a compelling research direction to address.


Figure 12. Qualitative comparison of ablations. We ablated the following components of our method: using a single iteration, removing the clustering stage, removing the LoRA trainable parameters, using the same initial representation at every iteration. As can be seen, all these ablated cases struggle with preserving the character’s consistency.


Figure 13. Consistent generation of non-character objects. Our approach is applicable to a wide range of objects, without the requirement for them to depict human characters or creatures.


Figure 14. Additional results. Our method is able to consistently generate different types and styles of characters, e.g., paintings, animations, stickers and vector art.


Figure 15. Life story. Given a text prompt describing a fictional character, “a photo of a man with short black hair”, we can generate a consistent life story for that character, demonstrating the applicability of our method for story generation.


Figure 16. Non-determinism. By running our method multiple times, given the same prompt “a photo of a 50 years old man with curly hair”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.


Figure 17. Non-determinism. By running our method multiple times, given the same prompt “a Plasticine of a cute baby cat with big eyes”, but using different initial seeds, we obtain different consistent characters corresponding to the text prompt.


Figure 18. Qualitative comparison to na¨ıve baselines. We tested two additional na¨ıve baselines against our method: TI [20] and LoRA DB [71] that were trained on a small dataset of 5 images generated from the same prompt. The baselines are referred to as TI multi (left column) and LoRA DB multi (middle column). As can be seen, both of these baselines fail to extract a consistent identity


Figure 19. Comparison to na¨ıve baselines. We tested two additional na¨ıve baselines against our method: TI [20] and LoRADB [71] that were trained on a small dataset of 5 images generated from the same prompt. The baselines are referred to as TI multi and LoRA DB multi. Our automatic testing procedure, described in Section 4.1, measures identity consistency and prompt similarity. As can be seen, both of these baselines fail to achieve high identity consistency.


Figure 20. Comparison of feature extractors. We tested two additional feature extractors in our method: DINOv1 [14] and CLIP [61]. Our automatic testing procedure, described in Section 4.1, measures identity consistency and prompt similarity. As can be seen, DINOv1 produces higher identity consistency by sacrificing prompt similarity, while CLIP results in higher prompt similarity at the expense of lower identity consistency. In practice, however, the DINOv1 results are similar to those obtained with DINOv2 features in terms of prompt adherence (see Figure 21).


Figure 21. Comparison of feature extractors. We experimented with two additional feature extractors in our method: DINOv1 [14] and CLIP [61]. As can be seen, DINOv1 results are qualitatively similar to DINOv2, whereas CLIP produces results with a slightly lower identity consistency.


Figure 22. Clustering visualization. We visualize the clustering of images generated with the prompt “a purple astronaut, digital art, smooth, sharp focus, vector art”. In the initial iteration (top three rows), our algorithm divides the generated images into three clusters: (1) emphasizing the astronaut’s head, (2) an astronaut without a face, and (3) a full-body astronaut. Cluster 1 (top row) is the most cohesive cluster, and it is chosen for the identity extraction phase. In the subsequent iteration (bottom three rows), all images adopt the same extracted identity, and the clusters mainly differ from each other in the pose of the character.


Figure 23. Dataset non-memorization. We found the top 5 nearest neighbors in the LAION-5B dataset [73], in terms of CLIP [61] image similarity, for a few representative characters from our paper, using an open-source solution [68]. As can be seen, our method does not simply memorize images from the LAION-5B dataset.


Figure 24. Our method using Stable Diffusion v2.1 backbone. We experimented with a version of our method that uses the Stable Diffusion v2.1 [69] model. As can be seen, our method can extract a consistent character, however, as expected, the results are of a lower quality than when using the SDXL [57] backbone that we use in the rest of this paper.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.