Table of Links
-
Related Work
-
SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation
3.2. Disentangled Motion Representation
-
Experiments and Evaluation Metrics
4.1. Dataset and Evaluation Metrics
Supplementary Material
B. Implementation Details
B.1 Disentangled VQ-VAE
B.2 Stratified Diffusion
In our transformer-based model for upper-body and lowerbody diffusion, we integrate an additional DiT block as described in [29]. Each model features 12 DiT blocks, each with 8 attention heads, and an input embedding dimension of 512. The full-body decoder is structured with 6 transformer layers.
B.3 Refiner
The complete loss term for training the refiner can be written as:
We set α, β, γ, δ to 0.01, 10, 0.05, and 0.01 to force the refiner to focus more on motion smoothness in the training process.
All experiments can be carried out on a single NVIDIA GeForce RTX 3090 GPU card, using the Pytorch framework.
Authors:
(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;
(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;
(3) Quankai Gao, University of Southern California;
(4) Xianwei Zheng, Wuhan University;
(5) Nan Xue, Ant Group ([email protected]);
(6) Huijuan Xu, Pennsylvania State University.
This paper is