A Unified Approach to In-The-Wild Realistic Human Image Generation

Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 4 Human Verse Dataset 5 Experiments 5.1 Main Results 5.2 Ablation Study 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses 3 OUR APPROACH We present HyperHuman that generates in-the-wild human images of high realism and diverse layouts. The overall framework is illustrated in Fig. 2. To make the content self-contained and narration clearer, we first introduce some pre-requisites of diffusion models and the problem setting in Sec. 3.1. Then, we present the Latent Structural Diffusion Model which simultaneously denoises the depth, surface-normal along with the RGB image. The explicit appearance and latent structure are thus jointly learned in a unified model (Sec. 3.2). Finally, we elaborate the Structure-Guided Refiner to compose the predicted conditions for detailed generation of higher resolution in Sec. 3.3. 3.1 PRELIMINARIES AND PROBLEM SETTING During inference, only the text prompt and body skeleton are needed to synthesize well-aligned RGB image, depth, and surface-normal. Note that the users are free to substitute their own depth and surface-normal conditions to G2 if applicable, enabling more flexible and controllable generation This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Authors: Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 3.3 Structure-Guided Refiner 4 Human Verse Dataset 4 Human Verse Dataset 5 Experiments 5 Experiments 5.1 Main Results 5.1 Main Results 5.2 Ablation Study 5.2 Ablation Study 6 Discussion and References 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses A.9 Licenses 3 OUR APPROACH We present HyperHuman that generates in-the-wild human images of high realism and diverse layouts. The overall framework is illustrated in Fig. 2. To make the content self-contained and narration clearer, we first introduce some pre-requisites of diffusion models and the problem setting in Sec. 3.1. Then, we present the Latent Structural Diffusion Model which simultaneously denoises the depth, surface-normal along with the RGB image. The explicit appearance and latent structure are thus jointly learned in a unified model (Sec. 3.2). Finally, we elaborate the Structure-Guided Refiner to compose the predicted conditions for detailed generation of higher resolution in Sec. 3.3. 3.1 PRELIMINARIES AND PROBLEM SETTING During inference, only the text prompt and body skeleton are needed to synthesize well-aligned RGB image, depth, and surface-normal. Note that the users are free to substitute their own depth and surface-normal conditions to G2 if applicable, enabling more flexible and controllable generation This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv