Authors:
(1) Yuxuan Yan,Tencent with Equal contributions and yuxuanyan@tencent.com;
(2) Chi Zhang, Tencent with Equal contributions and Corresponding Author, johnczhang@tencent.com;
(3) Rui Wang, Tencent and raywwang@tencent.com;
(4) Yichao Zhou, Tencent and yichaozhou@tencent.com;
(5) Gege Zhang, Tencent and gretazhang@tencent.com;
(6) Pei Cheng, Tencent and peicheng@tencent.com;
(7) Bin Fu, Tencent and brianfu@tencent.com;
(8) Gang Yu, Tencentm and skicyyu@tencent.com. Table of Links Abstract and 1 Introduction 2. Related Work 3. Method and 3.1. Hybrid Guidance Strategy 3.2. Handling Multiple Identities 3.3. Training 4. Experiments 4.1. Implementation details. 4.2. Results 5. Conclusion and References 4. Experiments 4.1. Implementation details. The vision encoder utilized in the image-conditioned branch of our model combines three CLIP model [40] variants with different backbones. These are: CLIP-ViT-L/14, CLIP-RN101, and CLIP-ViT-B/32. The outputs from these individual models are concatenated to produce the final output of our vision encoder. Our approach primarily utilizes the DDPM configuration [20] as described in StableDiffusion [42] for training. Specifically, we incorporated a total of 1,000 denoising steps. For the inference stage, we use the EulerA sampler [2] and set it to operate over 25 timesteps. To align with the training methodology of classifier-free guidance [19], we introduced variability by randomly omitting the conditional embeddings related to both style images and face images. Specifically, the probabilities for dropping these embeddings were set at 0.64 for style images and 0.1 for face images. The primary dataset used for training was FFHQ [25], which is a face image dataset encompassing 70,000 images. To augment this, we also incorporated a subset of the LAION dataset [46] into our training phase, which aims to ensure the model retains the capability to generate generic, non-human images during the finetuning process. It’s worth noting that when non-human images are sampled for training, the face embedding in the conditional branch is set to zero. During training, we set the learning rate at 1e-6. The model was trained using 8 A100 GPUs, with a batch size of 256, and was trained for 100,000 steps. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Yuxuan Yan,Tencent with Equal contributions and yuxuanyan@tencent.com; (2) Chi Zhang, Tencent with Equal contributions and Corresponding Author, johnczhang@tencent.com; (3) Rui Wang, Tencent and raywwang@tencent.com; (4) Yichao Zhou, Tencent and yichaozhou@tencent.com; (5) Gege Zhang, Tencent and gretazhang@tencent.com; (6) Pei Cheng, Tencent and peicheng@tencent.com; (7) Bin Fu, Tencent and brianfu@tencent.com; (8) Gang Yu, Tencentm and skicyyu@tencent.com. Authors: Authors: (1) Yuxuan Yan,Tencent with Equal contributions and yuxuanyan@tencent.com; (2) Chi Zhang, Tencent with Equal contributions and Corresponding Author, johnczhang@tencent.com; (3) Rui Wang, Tencent and raywwang@tencent.com; (4) Yichao Zhou, Tencent and yichaozhou@tencent.com; (5) Gege Zhang, Tencent and gretazhang@tencent.com; (6) Pei Cheng, Tencent and peicheng@tencent.com; (7) Bin Fu, Tencent and brianfu@tencent.com; (8) Gang Yu, Tencentm and skicyyu@tencent.com. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. Method and 3.1. Hybrid Guidance Strategy 3. Method and 3.1. Hybrid Guidance Strategy 3.2. Handling Multiple Identities 3.2. Handling Multiple Identities 3.3. Training 3.3. Training 4. Experiments 4.1. Implementation details. 4.1. Implementation details. 4.2. Results 4.2. Results 5. Conclusion and References 5. Conclusion and References 4. Experiments 4.1. Implementation details. The vision encoder utilized in the image-conditioned branch of our model combines three CLIP model [40] variants with different backbones. These are: CLIP-ViT-L/14, CLIP-RN101, and CLIP-ViT-B/32. The outputs from these individual models are concatenated to produce the final output of our vision encoder. Our approach primarily utilizes the DDPM configuration [20] as described in StableDiffusion [42] for training. Specifically, we incorporated a total of 1,000 denoising steps. For the inference stage, we use the EulerA sampler [2] and set it to operate over 25 timesteps. To align with the training methodology of classifier-free guidance [19], we introduced variability by randomly omitting the conditional embeddings related to both style images and face images. Specifically, the probabilities for dropping these embeddings were set at 0.64 for style images and 0.1 for face images. The primary dataset used for training was FFHQ [25], which is a face image dataset encompassing 70,000 images. To augment this, we also incorporated a subset of the LAION dataset [46] into our training phase, which aims to ensure the model retains the capability to generate generic, non-human images during the finetuning process. It’s worth noting that when non-human images are sampled for training, the face embedding in the conditional branch is set to zero. During training, we set the learning rate at 1e-6. The model was trained using 8 A100 GPUs, with a batch size of 256, and was trained for 100,000 steps. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

FaceStudio: Put Your Face Everywhere in Seconds: Implementation Details.

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

FaceStudio: Put Your Face Everywhere in Seconds: Abstract and Intro

Artificial Intelligence and Art: When Is Art Really Art?

The AI Renaissance: How Machines are Transforming Art, Music, and Literature

FaceStudio: Put Your Face Everywhere in Seconds: Abstract and Intro

FaceStudio: Put Your Face Everywhere in Seconds: Method and Hybrid Guidance Strategy

FaceStudio: Put Your Face Everywhere in Seconds: Abstract and Intro

Artificial Intelligence and Art: When Is Art Really Art?

The AI Renaissance: How Machines are Transforming Art, Music, and Literature

FaceStudio: Put Your Face Everywhere in Seconds: Abstract and Intro

FaceStudio: Put Your Face Everywhere in Seconds: Method and Hybrid Guidance Strategy

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps