Table of Links Abstract and 1. Introduction Abstract and 1. Introduction Related Work 2.1. Motion Reconstruction from Sparse Input 2.2. Human Motion Generation SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation 3.2. Disentangled Motion Representation 3.3. Stratified Motion Diffusion 3.4. Implementation Details Experiments and Evaluation Metrics 4.1. Dataset and Evaluation Metrics 4.2. Quantitative and Qualitative Results 4.3. Ablation Study Conclusion and References Related Work 2.1. Motion Reconstruction from Sparse Input 2.2. Human Motion Generation Related Work 2.1. Motion Reconstruction from Sparse Input 2.1. Motion Reconstruction from Sparse Input 2.2. Human Motion Generation 2.2. Human Motion Generation SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation 3.2. Disentangled Motion Representation 3.3. Stratified Motion Diffusion 3.4. Implementation Details SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation 3.2. Disentangled Motion Representation 3.2. Disentangled Motion Representation 3.3. Stratified Motion Diffusion 3.3. Stratified Motion Diffusion 3.4. Implementation Details 3.4. Implementation Details Experiments and Evaluation Metrics 4.1. Dataset and Evaluation Metrics 4.2. Quantitative and Qualitative Results 4.3. Ablation Study Experiments and Evaluation Metrics 4.1. Dataset and Evaluation Metrics 4.1. Dataset and Evaluation Metrics 4.2. Quantitative and Qualitative Results 4.2. Quantitative and Qualitative Results 4.3. Ablation Study 4.3. Ablation Study Conclusion and References Conclusion and References Conclusion and References Supplementary Material Supplementary Material A. Extra Ablation Studies A. Extra Ablation Studies B. Implementation Details B. Implementation Details 2. Related Work 2.1. Motion Reconstruction from Sparse Input The task of reconstructing full human body motion from sparse observations has gained significant attention in recent decades within the research community [1, 3, 5, 7, 10, 11, 16, 18, 19, 46, 47, 49–51, 54]. For instance, recent works [16, 19, 46, 50, 51] focus on reconstructing full body motion from six inertial measurement units (IMUs). SIP [46] employs heuristic methods, while DIP [16] pioneers the use of deep neural networks for this task. PIP [51] and TIP [19] further enhance performance by incorporating physics constraints. With the rise of VR/AR applications, researchers turn their attention toward reconstructing full body motion from VR/AR devices, such as head-mounted devices (HMDs), which only provide information about the user’s head and hands, posing additional challenges. LoBSTr [49], AvatarPoser [18], and AvatarJLM [54] approach this task as a regression problem, utilizing GRU [49] and Transformer Network [18, 54] to predict the full body pose from sparse observations of HMDs. Another line of methods employs generative models [5, 7, 10, 11]. For example, VAEHMD [10] and FLAG [5] utilize Variational AutoEncoder (VAE) [20] and Normalizing flow [35], respectively. Recent works [7, 11] leverage more powerful diffusion models [15, 38] for motion generation, yielding promising results due to the powerful ability of diffusion models in modeling the conditional probabilistic distribution of full-body motion. Contrasting with previous methods that model full-body motion in a comprehensive, unified framework, our approach acknowledges the complexities such methods impose on deep learning models, particularly in capturing the intricate kinematics of human motion. Hence, we propose a stratified approach that decouples the conventional full-body avatar reconstruction pipeline, first for the upper body and then for the lower body under the condition of the upper-body. Authors: (1) Han Feng, equal contributions, ordered by alphabet from Wuhan University; (2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University; (3) Quankai Gao, University of Southern California; (4) Xianwei Zheng, Wuhan University; (5) Nan Xue, Ant Group (xuenan@ieee.org); (6) Huijuan Xu, Pennsylvania State University. Authors: Authors: (1) Han Feng, equal contributions, ordered by alphabet from Wuhan University; (2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University; (3) Quankai Gao, University of Southern California; (4) Xianwei Zheng, Wuhan University; (5) Nan Xue, Ant Group (xuenan@ieee.org); (6) Huijuan Xu, Pennsylvania State University. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv