Authors:
(1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.;
(2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com;
(3) Aliaksandr Siarohin, Snap Inc.;
(4) Ivan Skorokhodov, Snap Inc.;
(5) Yanyu Li, Snap Inc.;
(6) Dahua Lin, CUHK;
(7) Xihui Liu, HKU;
(8) Ziwei Liu, NTU;
(9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 4 Human Verse Dataset 5 Experiments 5.1 Main Results 5.2 Ablation Study 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses A.2 MORE IMPLEMENTATION DETAILS We report implementation details like training hyper-parameters, and model architecture in Tab. 6. A.3 MORE ABLATION STUDY RESULTS We implement additional ablation study experiments on the second stage Structure-Guided Refiner. Note that due to the training resource limit and the resolution discrepancy between MS-COCO real images (512 × 512) and high-quality renderings (1024 × 1024), we conduct several toy ablation experiments in the lightweight 512 × 512 variant of our model: 1) w/o random dropout, where the all the input conditions are not dropout or masked out during the conditional training stage. 2) Only Text, where not any structural prediction is input to the model and we only use the text prompt as condition. 3) Condition on p, where we only use human pose skeleton p as input condition to the refiner network. 4) Condition on d that uses depth map d as input condition. 5) Condition on n that uses surface-normal n as input condition. And their combinations of 6) Condition on p, d; 7) Condition on p, n; 8) Condition on d, n, to verify the impact of each condition and the necessity of using such multi-level hierarchical structural guidance for fine-grained generation. The results are reported in Tab. 7. We can see that the random dropout conditioning scheme is crucial for more robust training with better image quality, especially in the two-stage generation pipeline. Besides, the structural map/guidance contains geometry and spatial relationship information, which are beneficial to image generation of higher quality. Another interesting phenomenon is that only conditioned on surface-normal n is better than conditioned on both the pose skeleton p and depth map d, which aligns with our intuition that surface-normal conveys rich structural information that mostly cover coarse-level skeleton and depth map, except for the keypoint location and foregroundbackground relationship. Overall, we can conclude from ablation results that: 1) Each condition (i.e., pose skeleton, depth map, and surface-normal) is important for higher-quality and more aligned generation, which validates the necessity of our first-stage Latent Structural Diffusion Model to jointly capture them. 2) The random dropout scheme for robust conditioning can essentially bridge the train-test error accumulation in two-stage pipeline, leading to better image results. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Authors: Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 3.3 Structure-Guided Refiner 4 Human Verse Dataset 4 Human Verse Dataset 5 Experiments 5 Experiments 5.1 Main Results 5.1 Main Results 5.2 Ablation Study 5.2 Ablation Study 6 Discussion and References 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses A.9 Licenses A.2 MORE IMPLEMENTATION DETAILS We report implementation details like training hyper-parameters, and model architecture in Tab. 6. A.3 MORE ABLATION STUDY RESULTS We implement additional ablation study experiments on the second stage Structure-Guided Refiner. Note that due to the training resource limit and the resolution discrepancy between MS-COCO real images (512 × 512) and high-quality renderings (1024 × 1024), we conduct several toy ablation experiments in the lightweight 512 × 512 variant of our model: 1) w/o random dropout, where the all the input conditions are not dropout or masked out during the conditional training stage. 2) Only Text, where not any structural prediction is input to the model and we only use the text prompt as condition. 3) Condition on p, where we only use human pose skeleton p as input condition to the refiner network. 4) Condition on d that uses depth map d as input condition. 5) Condition on n that uses surface-normal n as input condition. And their combinations of 6) Condition on p, d; 7) Condition on p, n; 8) Condition on d, n, to verify the impact of each condition and the necessity of using such multi-level hierarchical structural guidance for fine-grained generation. The results are reported in Tab. 7. We can see that the random dropout conditioning scheme is crucial for more robust training with better image quality, especially in the two-stage generation pipeline. Besides, the structural map/guidance contains geometry and spatial relationship information, which are beneficial to image generation of higher quality. Another interesting phenomenon is that only conditioned on surface-normal n is better than conditioned on both the pose skeleton p and depth map d, which aligns with our intuition that surface-normal conveys rich structural information that mostly cover coarse-level skeleton and depth map, except for the keypoint location and foregroundbackground relationship. Overall, we can conclude from ablation results that: 1) Each condition (i.e., pose skeleton, depth map, and surface-normal) is important for higher-quality and more aligned generation, which validates the necessity of our first-stage Latent Structural Diffusion Model to jointly capture them. 2) The random dropout scheme for robust conditioning can essentially bridge the train-test error accumulation in two-stage pipeline, leading to better image results. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

How Pose, Depth, and Surface-Normal Impact HyperHuman’s Image Quality

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Unified Approach to In-The-Wild Realistic Human Image Generation

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

The Noonification: Migrating from WebGL to WebGPU (12/20/2023)

A Unified Approach to In-The-Wild Realistic Human Image Generation

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

The Noonification: Migrating from WebGL to WebGPU (12/20/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps