(1) Yuxuan Yan,Tencent with Equal contributions and [email protected];
(2) Chi Zhang, Tencent with Equal contributions and Corresponding Author, [email protected];
(3) Rui Wang, Tencent and [email protected];
(4) Yichao Zhou, Tencent and [email protected];
(5) Gege Zhang, Tencent and [email protected];
(6) Pei Cheng, Tencent and [email protected];
(7) Bin Fu, Tencent and [email protected];
(8) Gang Yu, Tencentm and [email protected].
3. Method and 3.1. Hybrid Guidance Strategy
3.2. Handling Multiple Identities
4. Experiments
This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject’s identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for finetuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject’s identity with high fidelity.
In recent years, artificial intelligence (AI) has driven significant progress in the domain of creativity, leading to transformative changes across various applications. Particularly, arXiv:2312.02663v2 [cs.CV] 6 Dec 2023 text-to-image diffusion models [41, 42] have emerged as a notable development, capable of converting textual descriptions into visually appealing, multi-styled images. Such advancements have paved the way for numerous applications that were once considered to be beyond the realms of possibility.
However, despite these advancements, several challenges remain. One of the most prominent is the difficulty faced by existing text-to-image diffusion models in accurately capturing and describing an existing subject based solely on textual descriptions. This limitation becomes even more evident when detailed nuances, like human facial features, are the subject of generation. Consequently, there is a rising interest in the exploration of identity-preserving image synthesis, which encompasses more than just textual cues. In comparison to standard text-to-image generation, it integrates reference images in the generative process, thereby enhancing the capability of models to produce images tailored to individual preferences.
posed, with techniques such as DreamBooth [43] and Textual inversion [14] leading the way. They primarily focus on adjusting pre-trained text-to-image models to align more closely with user-defined concepts using reference images. However, these methods come with their set of limitations. The fine-tuning process, essential to these methods, is resource-intensive and time-consuming, often demanding significant computational power and human intervention. Moreover, the requirement for multiple reference images for accurate model fitting poses additional challenges.
In light of these constraints, our research introduces a novel approach focusing on identity-preserving image synthesis, especially for human images. Our model, in contrast to existing ones, adopts a direct feed-forward approach, eliminating the cumbersome fine-tuning steps and offering rapid and efficient image generation. Central to our model is a hybrid guidance module, which guides the image generation of the latent diffusion model. This module not only considers textual prompts as conditions for image synthesis but also integrates additional information from the style image and the identity image. By employing this hybridguided strategy, our framework places additional emphasis on the identity details from a given human image during generations. To effectively manage images with multiple identities, we develop a multi-identity cross-attention mechanism, which enables the model to aptly associate guidance particulars from various identities with specific human regions within an image.
Our training method is intuitive yet effective. Our model can be easily trained with human image datasets. By employing images with the facial features masked as the style image input and the extracted face as the identity input, our model learns to reconstruct the human images while highlighting identity features in the guidance. After training, our model showcases an impressive ability to synthesize human images that retain the subject’s identity with exceptional fidelity, obviating the need for any further adjustments. A unique aspect of our method is its ability to superimpose a user’s facial features onto any stylistic image, such as a cartoon, enabling users to visualize themselves in diverse styles without compromising their identity. Additionally, our model excels in generating images that amalgamate multiple identities when supplied with the respective reference photos. Fig. 1 shows some applications of our model.
This paper’s contributions can be briefly summarized as follows:
• We present a tuning-free hybrid-guidance image generation framework capable of preserving human identities under various image styles.
• We develop a multi-identity cross-attention mechanism, which exhibits a distinct ability to map guidance details from multiple identities to specific human segments in an image.
• We provide comprehensive experimental results, both qualitative and quantitative, to highlight the superiority of our method over baseline models and existing works, especially in its unmatched efficiency.
This paper is available on arxiv under CC0 1.0 DEED license.