Temporal Refinement in Stratified Motion Diffusion: Utilizing GRU for Smoothed Full-Body Prediction

Written by zaddy | Published 2025/10/22
Tech Story Tags: ar | 3d-avatar-generation | sparse-observations | head-mounted-devices-(hmds) | full-body-reconstruction | kinematic-tree | smpl-model | latent-diffusion-model

TLDRThis article describes the online inference stage of the model, which predicts 3D avatars in a frame-by-frame manner using a sliding window approachvia the TL;DR App

Abstract and 1. Introduction

  1. Related Work

    2.1. Motion Reconstruction from Sparse Input

    2.2. Human Motion Generation

  2. SAGE: Stratified Avatar Generation and 3.1. Problem Statement and Notation

    3.2. Disentangled Motion Representation

    3.3. Stratified Motion Diffusion

    3.4. Implementation Details

  3. Experiments and Evaluation Metrics

    4.1. Dataset and Evaluation Metrics

    4.2. Quantitative and Qualitative Results

    4.3. Ablation Study

  4. Conclusion and References

Supplementary Material

A. Extra Ablation Studies

B. Implementation Details

3.4. Implementation Details

For the inference stage, we evaluate our model in an online manner. Specifically, we fix the sequence length at 20 for both the input and the output of our model, and only the last pose in the output motion sequence is retained. Given a sparse observation sequence, we apply our model using a sliding window approach. For the first 20 poses in the motion sequence, we predict by padding the sparse observation sequence x at the beginning with the first available observation. We make this choice considering the practicality and relevance of online inference in real-world application scenarios. This allows the motion sequences to be predicted in a frame-by-frame manner.

In addition, we employ a simple two-layer GRU [9] on the top of the full body decoder as a temporal memory to smooth the prediction of the output sequence with minimal computational expense, and we term it as a Refiner. To train this Refiner, we use the same velocity loss as [54]. Our model takes 0.74ms to infer 1 frame on a single NVIDIA RTX3090 GPU.

Authors:

(1) Han Feng, equal contributions, ordered by alphabet from Wuhan University;

(2) Wenchao Ma, equal contributions, ordered by alphabet from Pennsylvania State University;

(3) Quankai Gao, University of Southern California;

(4) Xianwei Zheng, Wuhan University;

(5) Nan Xue, Ant Group ([email protected]);

(6) Huijuan Xu, Pennsylvania State University.


This paper is available on arxiv under CC BY 4.0 DEED license.


Written by zaddy | Zaddy
Published by HackerNoon on 2025/10/22