Authors:
(1) Luyang Zhu, University of Washington and Google Research, and work done while the author was an intern at Google;
(2) Dawei Yang, Google Research;
(3) Tyler Zhu, Google Research;
(4) Fitsum Reda, Google Research;
(5) William Chan, Google Research;
(6) Chitwan Saharia, Google Research;
(7) Mohammad Norouzi, Google Research;
(8) Ira Kemelmacher-Shlizerman, University of Washington and Google Research. Table of Links Abstract and 1. Introduction 2. Related Work 3. Method 3.1. Cascaded Diffusion Models for Try-On 3.2. Parallel-UNet 4. Experiments 5. Summary and Future Work and References Appendix A. Implementation Details B. Additional Results B. Additional Results In Fig. 9 and 10, we provide qualitative comparison to state-of-the-art methods on challenging cases. We select input pairs from our 6K testing dataset with heavy occlusions and extreme body pose and shape differences. We can see that our method can generate more realistic results compared to baselines. In Fig. 11 and 12, we provide qualitative comparison to state-of-the-art methods on simple cases. We select input pairs from our 6K test dataset with minimum garment warp and simple texture pattern. Baseline methods perform better for simple cases than for challenging cases. However, our method is still better at garment detail preservation and blending (of person and garment). In Fig. 13, we provide more qualitative results on the VITON-HD unpaired testing dataset. For fair comparison, we run a new user study to compare SDAFN [2] vs our method at SDAFN’s 256 × 256 resolution. To generate a 256 × 256 image with our method, we only run inference on the first two stages of our cascaded diffusion models and ignore the 256×256→1024×1024 SR diffusion. Table 3 shows results consistent with the user study reported in the paper. We also compare to HRVITON [25] using their released checkpoints. Note that original HR-VTION is trained on frontal garment images, so we select input garments satisfying this constraint to avoid unfair comparison. Fig. 16 shows that our method is still better than HR-VITON under its optimal cases using its released checkpoints. Table 4 reports quantitative results for ablation studies. Fig. 14 visualizes more examples for the ablation study of combining warp and blend versus sequencing the tasks. Fig. 15 provides more qualitative comparisons between concatenation and cross attention for implicit warping. g. We further investigate the effect of the training dataset size. We retrained our method from scratch on 10K and 100K random pairs from our 4M set and report quantitative results (FID and KID) on two different test sets in Table 5. Fig. 17 also shows visual results for our models trained on different dataset sizes. In Fig. 6 of the main paper, we provide failure cases due to erroneous garment segmentation and garment leaks in the clothing-agnostic RGB image. In Fig. 18, we provide more failure cases of our method. The main problem lies in the clothing-agnostic RGB image. Specifically, it removes part of the identity information from the target person, e.g., tattoos (row one), muscle structure (row two), fine hair on the skin (row two) and accessories (row three). To better visualize the difference in person identity, Fig. 19 provides try-on results on paired unseen test samples, where groundtruth is available. Fig. 20 shows try-on results for a challenging case, where input person wearing garment with no folds, and input garment with folds. We can see that our method can generate realistic folds according to the person pose instead of copying folds from the garment input. Fig. 21 and 22 show TryOnDiffusion results on variety of people and garments for both men and women. Finally, Fig. 23 to 28 provide zoom-in visualization for Fig. 1 of the main paper, demonstrating high quality results of our method. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Luyang Zhu, University of Washington and Google Research, and work done while the author was an intern at Google; (2) Dawei Yang, Google Research; (3) Tyler Zhu, Google Research; (4) Fitsum Reda, Google Research; (5) William Chan, Google Research; (6) Chitwan Saharia, Google Research; (7) Mohammad Norouzi, Google Research; (8) Ira Kemelmacher-Shlizerman, University of Washington and Google Research. Authors: Authors: (1) Luyang Zhu, University of Washington and Google Research, and work done while the author was an intern at Google; (2) Dawei Yang, Google Research; (3) Tyler Zhu, Google Research; (4) Fitsum Reda, Google Research; (5) William Chan, Google Research; (6) Chitwan Saharia, Google Research; (7) Mohammad Norouzi, Google Research; (8) Ira Kemelmacher-Shlizerman, University of Washington and Google Research. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Related Work 2. Related Work 3. Method 3. Method 3.1. Cascaded Diffusion Models for Try-On 3.1. Cascaded Diffusion Models for Try-On 3.2. Parallel-UNet 3.2. Parallel-UNet 4. Experiments 4. Experiments 5. Summary and Future Work and References 5. Summary and Future Work and References Appendix Appendix A. Implementation Details A. Implementation Details B. Additional Results B. Additional Results B. Additional Results In Fig. 9 and 10, we provide qualitative comparison to state-of-the-art methods on challenging cases. We select input pairs from our 6K testing dataset with heavy occlusions and extreme body pose and shape differences. We can see that our method can generate more realistic results compared to baselines. In Fig. 11 and 12, we provide qualitative comparison to state-of-the-art methods on simple cases. We select input pairs from our 6K test dataset with minimum garment warp and simple texture pattern. Baseline methods perform better for simple cases than for challenging cases. However, our method is still better at garment detail preservation and blending (of person and garment). In Fig. 13, we provide more qualitative results on the VITON-HD unpaired testing dataset. For fair comparison, we run a new user study to compare SDAFN [2] vs our method at SDAFN’s 256 × 256 resolution. To generate a 256 × 256 image with our method, we only run inference on the first two stages of our cascaded diffusion models and ignore the 256×256→1024×1024 SR diffusion. Table 3 shows results consistent with the user study reported in the paper. We also compare to HRVITON [25] using their released checkpoints. Note that original HR-VTION is trained on frontal garment images, so we select input garments satisfying this constraint to avoid unfair comparison. Fig. 16 shows that our method is still better than HR-VITON under its optimal cases using its released checkpoints. Table 4 reports quantitative results for ablation studies. Fig. 14 visualizes more examples for the ablation study of combining warp and blend versus sequencing the tasks. Fig. 15 provides more qualitative comparisons between concatenation and cross attention for implicit warping. g. We further investigate the effect of the training dataset size. We retrained our method from scratch on 10K and 100K random pairs from our 4M set and report quantitative results (FID and KID) on two different test sets in Table 5. Fig. 17 also shows visual results for our models trained on different dataset sizes. In Fig. 6 of the main paper, we provide failure cases due to erroneous garment segmentation and garment leaks in the clothing-agnostic RGB image. In Fig. 18, we provide more failure cases of our method. The main problem lies in the clothing-agnostic RGB image. Specifically, it removes part of the identity information from the target person, e.g., tattoos (row one), muscle structure (row two), fine hair on the skin (row two) and accessories (row three). To better visualize the difference in person identity, Fig. 19 provides try-on results on paired unseen test samples, where groundtruth is available. Fig. 20 shows try-on results for a challenging case, where input person wearing garment with no folds, and input garment with folds. We can see that our method can generate realistic folds according to the person pose instead of copying folds from the garment input. Fig. 21 and 22 show TryOnDiffusion results on variety of people and garments for both men and women. Finally, Fig. 23 to 28 provide zoom-in visualization for Fig. 1 of the main paper, demonstrating high quality results of our method. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Comparative Analysis of TryOnDiffusion with Other State-of-the-Art Methods

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Behind the Scenes of TryOnDiffusion

TryOnDiffusion: A Tale of Two UNets

How TryOnDiffusion Innovates on Existing Virtual Try-On Frameworks

Behind the Scenes of TryOnDiffusion

Cascaded Diffusion Models for Try-On

How Parallel-UNet Transforms Virtual Try-On with Implicit Warping and Unified Operations

Behind the Scenes of TryOnDiffusion

TryOnDiffusion: A Tale of Two UNets

How TryOnDiffusion Innovates on Existing Virtual Try-On Frameworks

Behind the Scenes of TryOnDiffusion

Cascaded Diffusion Models for Try-On

How Parallel-UNet Transforms Virtual Try-On with Implicit Warping and Unified Operations

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps