Authors:
(1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.;
(2) Jian Ren, Snap Inc. with Corresponding author: [email protected];
(3) Aliaksandr Siarohin, Snap Inc.;
(4) Ivan Skorokhodov, Snap Inc.;
(5) Yanyu Li, Snap Inc.;
(6) Dahua Lin, CUHK;
(7) Xihui Liu, HKU;
(8) Ziwei Liu, NTU;
(9) Sergey Tulyakov, Snap Inc.
3 Our Approach and 3.1 Preliminaries and Problem Setting
3.2 Latent Structural Diffusion Model
A Appendix and A.1 Additional Quantitative Results
A.2 More Implementation Details and A.3 More Ablation Study Results
A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration
A.7 More Comparison Results and A.8 Additional Qualitative Results
We report implementation details like training hyper-parameters, and model architecture in Tab. 6.
We implement additional ablation study experiments on the second stage Structure-Guided Refiner. Note that due to the training resource limit and the resolution discrepancy between MS-COCO real images (512 × 512) and high-quality renderings (1024 × 1024), we conduct several toy ablation experiments in the lightweight 512 × 512 variant of our model: 1) w/o random dropout, where the all the input conditions are not dropout or masked out during the conditional training stage. 2) Only Text, where not any structural prediction is input to the model and we only use the text prompt as condition. 3) Condition on p, where we only use human pose skeleton p as input condition to the refiner network. 4) Condition on d that uses depth map d as input condition. 5) Condition on n that uses surface-normal n as input condition. And their combinations of 6) Condition on p, d; 7) Condition on p, n; 8) Condition on d, n, to verify the impact of each condition and the necessity of using such multi-level hierarchical structural guidance for fine-grained generation. The results are reported in Tab. 7. We can see that the random dropout conditioning scheme is crucial for more robust training with better image quality, especially in the two-stage generation pipeline. Besides, the structural map/guidance contains geometry and spatial relationship information, which are beneficial to image generation of higher quality. Another interesting phenomenon is that only conditioned on surface-normal n is better than conditioned on both the pose skeleton p and depth map d, which aligns with our intuition that surface-normal conveys rich structural information that mostly cover coarse-level skeleton and depth map, except for the keypoint location and foregroundbackground relationship. Overall, we can conclude from ablation results that: 1) Each condition (i.e., pose skeleton, depth map, and surface-normal) is important for higher-quality and more aligned generation, which validates the necessity of our first-stage Latent Structural Diffusion Model to jointly capture them. 2) The random dropout scheme for robust conditioning can essentially bridge the train-test error accumulation in two-stage pipeline, leading to better image results.
This paper is available on arxiv under CC BY 4.0 DEED license.