Authors:
(1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.;
(2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com;
(3) Aliaksandr Siarohin, Snap Inc.;
(4) Ivan Skorokhodov, Snap Inc.;
(5) Yanyu Li, Snap Inc.;
(6) Dahua Lin, CUHK;
(7) Xihui Liu, HKU;
(8) Ziwei Liu, NTU;
(9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 4 Human Verse Dataset 5 Experiments 5.1 Main Results 5.2 Ablation Study 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses 4 HUMANVERSE DATASET Large-scale datasets with high quality samples, rich annotations, and diverse distribution are crucial for image generation tasks (Schuhmann et al., 2022; Podell et al., 2023), especially in the human domain (Liu et al., 2016;Jiang et al., 2023). To facilitate controllable human generation of high-fidelity, we establish a comprehensive human dataset with extensive annotations named HumanVerse. Please kindly refer to Appendix A.9 for more details about the dataset and annotation resources we use. Dataset Preprocessing. We curate from two principled datasets: LAION-2B-en (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022). To isolate human images, we employ YOLOS (Fang et al., 2021) for human detection. Specifically, only those images containing 1 to 3 human bounding boxes are retained, where people should be visible with an area ratio exceeding 15%. We further rule out samples of poor aesthetics (< 4.5) or low resolution (< 200 × 200). This yields a high-quality subset by eliminating blurry and over-small humans. Unlike existing models that mostly train on full-body humans of simple context (Zhang & Agrawala, 2023), our dataset encompasses a wider spectrum, including various backgrounds and partial human regions such as clothing and limbs. 2D Human Poses. 2D human poses (skeleton of joint keypoints), which serve as one of the most flexible and easiest obtainable coarse-level condition signals, are widely used in controllable human generation studies (Ju et al., 2023b). To achieve accurate keypoint annotations, we resort to MMPose (Contributors, 2020) as inference interface and choose ViTPose-H (Xu et al., 2022) as backbone that performs best over several pose estimation benchmarks. In particular, the per-instance bounding box, keypoint coordinates and confidence are labeled, including whole-body skeleton (133 keypoints), body skeleton (17 keypoints), hand (21 keypoints), and facial landmarks (68 keypoints). Depth and Surface-Normal Maps are fine-grained structures that reflect the spatial geometry of images (Wu et al., 2022), which are commonly used in conditional generation (Mou et al., 2023). We apply Omnidata (Eftekhar et al., 2021) for monocular depth and normal. The MiDaS (Ranftl et al., 2022) is further annotated following recent depth-to-image pipelines (Rombach et al., 2022). Outpaint for Accurate Annotations. Diffusion models have shown promising results on image inpainting and outpainting, where the appearance and structure of unseen regions can be hallucinated based on the visible parts. Motivated by this, we propose to outpaint each image for a more holistic view given that most off-the-shelf structure estimators are trained on the “complete” image views. Although the outpainted region may be imperfect with artifacts, it can complement a more comprehensive human structure. To this end, we utilize the powerful SD-Inpaint to outpaint the surrounding areas of the original canvas. These images are further processed by off-the-shelf estimators, where we only use the labeling within the original image region for more accurate annotations. Overall Statistics. In summary, COYO subset contains 90, 948, 474 (91M) images and LAION-2B subset contains 248, 396, 109 (248M) images, which is 18.12% and 20.77% of fullset, respectively. The whole annotation process takes 640 16/32G NVIDIA V100 GPUs for two weeks in parallel. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Authors: Authors: (1) Xian Liu, Snap Inc., CUHK with Work done during an internship at Snap Inc.; (2) Jian Ren, Snap Inc. with Corresponding author: jren@snapchat.com; (3) Aliaksandr Siarohin, Snap Inc.; (4) Ivan Skorokhodov, Snap Inc.; (5) Yanyu Li, Snap Inc.; (6) Dahua Lin, CUHK; (7) Xihui Liu, HKU; (8) Ziwei Liu, NTU; (9) Sergey Tulyakov, Snap Inc. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Our Approach and 3.1 Preliminaries and Problem Setting 3 Our Approach and 3.1 Preliminaries and Problem Setting 3.2 Latent Structural Diffusion Model 3.2 Latent Structural Diffusion Model 3.3 Structure-Guided Refiner 3.3 Structure-Guided Refiner 4 Human Verse Dataset 4 Human Verse Dataset 5 Experiments 5 Experiments 5.1 Main Results 5.1 Main Results 5.2 Ablation Study 5.2 Ablation Study 6 Discussion and References 6 Discussion and References A Appendix and A.1 Additional Quantitative Results A Appendix and A.1 Additional Quantitative Results A.2 More Implementation Details and A.3 More Ablation Study Results A.2 More Implementation Details and A.3 More Ablation Study Results A.4 More User Study Details A.4 More User Study Details A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.5 Impact of Random Seed and Model Robustness and A.6 Boarder Impact and Ethical Consideration A.7 More Comparison Results and A.8 Additional Qualitative Results A.7 More Comparison Results and A.8 Additional Qualitative Results A.9 Licenses A.9 Licenses 4 HUMANVERSE DATASET Large-scale datasets with high quality samples, rich annotations, and diverse distribution are crucial for image generation tasks (Schuhmann et al., 2022; Podell et al., 2023), especially in the human domain (Liu et al., 2016;Jiang et al., 2023). To facilitate controllable human generation of high-fidelity, we establish a comprehensive human dataset with extensive annotations named HumanVerse. Please kindly refer to Appendix A.9 for more details about the dataset and annotation resources we use. Dataset Preprocessing. We curate from two principled datasets: LAION-2B-en (Schuhmann et al., 2022) and COYO-700M (Byeon et al., 2022). To isolate human images, we employ YOLOS (Fang et al., 2021) for human detection. Specifically, only those images containing 1 to 3 human bounding boxes are retained, where people should be visible with an area ratio exceeding 15%. We further rule out samples of poor aesthetics (< 4.5) or low resolution (< 200 × 200). This yields a high-quality subset by eliminating blurry and over-small humans. Unlike existing models that mostly train on full-body humans of simple context (Zhang & Agrawala, 2023), our dataset encompasses a wider spectrum, including various backgrounds and partial human regions such as clothing and limbs. Dataset Preprocessing. 2D Human Poses. 2D human poses (skeleton of joint keypoints), which serve as one of the most flexible and easiest obtainable coarse-level condition signals, are widely used in controllable human generation studies (Ju et al., 2023b). To achieve accurate keypoint annotations, we resort to MMPose (Contributors, 2020) as inference interface and choose ViTPose-H (Xu et al., 2022) as backbone that performs best over several pose estimation benchmarks. In particular, the per-instance bounding box, keypoint coordinates and confidence are labeled, including whole-body skeleton (133 keypoints), body skeleton (17 keypoints), hand (21 keypoints), and facial landmarks (68 keypoints). 2D Human Poses. Depth and Surface-Normal Maps are fine-grained structures that reflect the spatial geometry of images (Wu et al., 2022), which are commonly used in conditional generation (Mou et al., 2023). We apply Omnidata (Eftekhar et al., 2021) for monocular depth and normal. The MiDaS (Ranftl et al., 2022) is further annotated following recent depth-to-image pipelines (Rombach et al., 2022). Depth and Surface-Normal Maps Outpaint for Accurate Annotations. Diffusion models have shown promising results on image inpainting and outpainting, where the appearance and structure of unseen regions can be hallucinated based on the visible parts. Motivated by this, we propose to outpaint each image for a more holistic view given that most off-the-shelf structure estimators are trained on the “complete” image views. Although the outpainted region may be imperfect with artifacts, it can complement a more comprehensive human structure. To this end, we utilize the powerful SD-Inpaint to outpaint the surrounding areas of the original canvas. These images are further processed by off-the-shelf estimators, where we only use the labeling within the original image region for more accurate annotations. Outpaint for Accurate Annotations. Overall Statistics. In summary, COYO subset contains 90, 948, 474 (91M) images and LAION-2B subset contains 248, 396, 109 (248M) images, which is 18.12% and 20.77% of fullset, respectively. The whole annotation process takes 640 16/32G NVIDIA V100 GPUs for two weeks in parallel. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

HumanVerse: A Richly Annotated Dataset for Realistic, Controllable Human Image Synthesis

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Unified Approach to In-The-Wild Realistic Human Image Generation

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

The Noonification: Migrating from WebGL to WebGPU (12/20/2023)

A Unified Approach to In-The-Wild Realistic Human Image Generation

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

The Noonification: Migrating from WebGL to WebGPU (12/20/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps