The Case for Bigger Models in Human Motion AI

This is a Plain English Papers summary of a research paper called HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The motion generation problem

For years, generating realistic human motion from text descriptions has felt stuck. Current models either fail to understand what you're asking for or produce movement that looks jerky and unnatural. Ask for an "angry walk toward a door," and the model might generate walking that's roughly the right speed but misses the emotional quality. Ask for something specific like "athletic jump with both arms extended," and it often collapses entirely. The fundamental challenge is that motion has temporal structure, physical constraints, and an almost infinite solution space. Unlike generating a static image where pixels either look right or wrong, motion requires the model to understand not just the shape of movement, but how emotion deforms it, how intention curves trajectories, and how multiple text concepts combine into a single coherent sequence.

This is why every model released so far has struggled with instruction-following. They catch maybe 70% of what you asked for and miss the nuance. The problem isn't that researchers don't understand the algorithms well enough. The bottleneck is something deeper: models trained at small scale simply don't develop the ability to understand and follow detailed instructions the way language models or image generators do.

The scaling hypothesis

The past five years of AI progress have been driven almost entirely by scaling. GPT-2 at 1.5 billion parameters could barely write coherent paragraphs. Increase that scale tenfold and something shifts. The model doesn't just do the same thing slightly better. It develops new capabilities. It reasons about edge cases it never encountered. It understands nuance and context in ways that feel qualitatively different from smaller versions.

The question for motion generation is straightforward: does this pattern hold? Or is something fundamentally different about this problem that makes scaling unhelpful?

HY-Motion answers that question by testing the hypothesis directly. Build a billion-parameter motion generation model and train it properly, and it develops instruction-following capabilities that smaller models never achieve. A small model learns to generate common motions competently. A billion-parameter model learns to listen to instructions, to combine concepts flexibly, to handle rare motion combinations and specific constraints. The research reveals that motion generation follows the same scaling laws as language and image generation, but only under one crucial condition: you need the right training data and the right training strategy.

Building the right foundation

Scaling only works if you have high-quality data to scale on. This is the unsexy part of the paper, the part many researchers skip, but it's actually where much of the breakthrough lives.

The fundamental problem is that motion datasets are messy. Raw motion capture contains jitter and artifacts from the recording process. Text descriptions are often vague or incorrect. Without cleaning, any model trained on this noise learns garbled patterns. HY-Motion treats data as a first-class problem.

Data processing pipeline overview

The data processing pipeline shows how raw motion capture data flows through cleaning, annotation, and quality control stages

The processing pipeline performs rigorous motion cleaning to remove artifacts and temporal inconsistencies. Careful captioning ensures text actually describes the motion rather than being generic labels. The team then organized motions into a hierarchical structure that gives the model rich conceptual structure to learn from.

Motion category hierarchy spanning 200+ categories across 6 major classes

The hierarchy shows how motions are organized: 6 major classes branch into 200+ specific categories, giving the model granular conceptual structure

This hierarchical organization isn't arbitrary. It reflects how motion actually structures itself in human understanding. The model learns not just individual motions, but relationships between them. How does walking differ from running? How does emotion modulate both? The cleaned dataset spans over 3,000 hours of motion data, and another 400 hours gets reserved for high-quality fine-tuning. This foundation is what makes scaling meaningful. Without it, you'd train a billion-parameter model on garbage.

The three-stage training recipe

Large-scale models are brittle without the right training strategy. HY-Motion's real innovation is the systematic approach that stabilizes and aligns them. This recipe mirrors how state-of-the-art language models are trained, bringing proven techniques from another domain into motion generation.

Pretraining on large-scale unsupervised data is where the model learns the grammar of motion. Show it 3,000 hours of diverse human movement, and it learns that walking has rhythm, that reaching has velocity constraints, that expressions change how bodies move. The model isn't trying to follow instructions yet. It's learning what plausible motion looks like, building rich internal representations of movement dynamics.

High-quality fine-tuning comes next. After pretraining, take that much smaller dataset of 400 exceptionally curated motion-text pairs. The model has already learned how bodies move. Now it learns to listen to what's being asked. This stage is where instruction-following gets tight. You want perfect alignment between text and motion here, so the model learns what precision feels like. This is the domain expert phase, where a smaller dataset of perfect examples shapes the model's behavior.

Reinforcement learning from both human feedback and reward models completes the loop. Rather than assuming fine-tuning labels are perfect, the paper adds a feedback mechanism. Humans and learned reward models evaluate whether generated motions actually match text descriptions. The model then learns to optimize against this feedback. This is where the model learns to correct its own mistakes and push past previous limitations.

Overview of the HY-Motion framework showing all three training stages

The framework diagram shows how pretraining, fine-tuning, and RLHF stages feed into each other, creating a complete training pipeline

This three-stage approach is the insight that transfers beyond motion generation. Pretraining on massive unsupervised data creates general capability. Fine-tuning on curated data tightens behavior. Alignment via feedback ensures the model does what humans actually want. The same pattern has worked for language models and shows similar promise here.

The architecture that makes scaling work

All of this depends on an architecture that actually scales well. HY-Motion uses flow matching combined with a Transformer-based architecture called DiT (Diffusion Transformer).

Flow matching is a newer generative modeling approach that's simpler and more efficient than traditional diffusion. Instead of gradually adding noise and learning to reverse it, flow matching learns smooth paths from simple distributions directly to data. When combined with a Transformer architecture proven to work at billion-parameter scale in other domains, you get something that scales beautifully. Transformers handle sequential data naturally, and flow matching gives you cleaner gradients and better sample efficiency than alternatives.

Model architecture of HY-Motion DiT

The architecture diagram shows how transformer layers process motion tokens while the flow matching objective guides generation

The choice matters because it determines whether scaling actually helps or just makes things slower. Flow matching is the right algorithmic fit for motion, the way transformer architectures were the right fit for language.

Breaking the ceiling

After building the data foundation, designing the training recipe, and choosing the right architecture, the question is simple: does it work?

Comparison of HY-Motion 1.0 to state-of-the-art models including DART, LoM, GoToZero, and MoMask

Top: Quantitative comparison shows HY-Motion outperforming current open-source benchmarks. Bottom: Visual examples of HY-Motion outputs demonstrate natural and precise motion generation

The difference between HY-Motion and previous models isn't incremental. It's qualitative. Previous models generate motion that vaguely resembles what you asked for, but specificity breaks them. Ask for an angry walk with both arms down, and they struggle. Ask for motion with fine-grained athletic requirements, and they often fail. HY-Motion actually listens.

Visual comparison examples showing HY-Motion versus state-of-the-art alternatives

Side-by-side examples show how HY-Motion produces more natural, anatomically correct motion that more precisely matches text descriptions

The model has developed instruction-following capabilities that smaller models never achieved. It understands motion at the level of detail that matters for real applications. The coverage is extensive too, spanning over 200 motion categories across 6 major classes. This breadth is what gives the model flexibility to handle novel combinations of concepts.

Implications for the future

HY-Motion demonstrates something important: motion generation follows the same scaling laws as language and image generation. This matters beyond motion specifically. It shows that when you pick a new domain and want to unlock new capabilities, the path is predictable. Gather massive amounts of data. Clean it meticulously. Train at scale. Refine with human feedback.

This also signals where motion generation is heading. For years it's been a research curiosity, interesting but not yet practical. HY-Motion represents a transition point where the technology is starting to move toward commercial maturity. That's not because of a fundamental algorithmic breakthrough, but because someone built the data pipeline and training recipe that makes scaling work.

For researchers working in other domains where scaling hasn't yet unlocked capabilities, the lesson is clear: look first at your data and training strategy before designing new architectures. The bottleneck is usually not the algorithm.

The open-source release of HY-Motion is significant too. Previous motion generation work lived in closed research codebases. Making this available accelerates the entire field. Other researchers can build on this foundation, test new ideas, and push the boundary forward faster than any single lab could alone.

The broader pattern here connects to work on text-driven motion generation that identified core challenges in the field, and research like GoToZero and OmnimoGen that explored different architectural approaches. HY-Motion doesn't replace those efforts. It builds on them, showing that the scaling hypothesis holds when you get the fundamentals right.