OmniLottie Solves AI Animation’s Hardest Problem

The animation generation gap

You can ask an AI to draw you a picture. But can you ask it to draw you a dancing picture, where every movement follows your instructions? That's the puzzle OmniLottie solves.

Text-to-image generation changed everything because we solved the encoding problem. Images are pixel arrays, and deep learning speaks fluently in arrays. But animations aren't just pixels, they're choreography, sequences, relationships between objects that move together. When you ask Midjourney for a picture, there's no temporal dimension to get wrong. When you ask an AI to animate a dancing character, you're asking it to specify thousands of decisions across time. There's no obvious way to reduce this to something a language model can handle.

This matters because animation remains a frontier. AI can generate still images impressively well, but the technical barriers for motion are higher. It's not a matter of compute or training data, but of representation, how we ask machines to reason about motion itself. For years, the field has lacked an obvious abstraction layer between human intent ("make this character bounce") and machine capability (language models that excel at following instructions).

What makes Lottie difficult

Lottie is the industry standard for vector animations. Adobe uses it, Figma uses it, countless design tools rely on it. A Lottie file is JSON, which means it's theoretically readable by any computer system. It's brilliant for execution, for telling a renderer exactly what to draw and when. But it's terrible for learning.

A simple bouncing ball animation becomes a thousand-line JSON file filled with layer definitions, transform matrices, timing curves, and namespace declarations. Most of that is structural boilerplate, the equivalent of instructions for how to hold the paintbrush. The actual creative signal, the animation logic, is buried underneath.

This creates a fundamental problem: if you try to train a language model on raw Lottie JSON, treating it like code or any other text format, you're asking the model to learn from mostly noise. Standard tokenizers chop the JSON into fragments that mean nothing in isolation. The model wastes its capacity parsing structural metadata instead of understanding motion semantics. You could throw more data and compute at the problem, but you'd be fighting the representation rather than working with it.

Tokenization as translation

The core insight is simple: tokenization should be translation. Take Lottie JSON, written for renderers to read, and rewrite it in a language optimized for learning.

Instead of preserving every structural detail, you extract the semantic content. "Draw a rectangle here. Animate its position with this easing curve. Make it fade in over 500 milliseconds." Each piece becomes a token, a unit of meaning. The clever part is designing which details matter for learning animation and which are implementation noise.

OmniLottie's tokenizer does this systematically. It transforms Lottie files from "here's how to render this" into "here's what to animate." Rather than tokenizing JSON syntax, it tokenizes animation semantics, shapes, animation functions, and control parameters. This solves the representation problem by asking a specific question: what's the minimal set of commands and parameters a language model needs to understand and generate animations?

Once you have that, everything else becomes tractable. A rectangle becomes a single shape token with parameters (position, size, color). An animation curve becomes a function token with its control points and duration. Keyframe timing becomes explicit rather than buried in nested objects. The transformation is lossy on purpose, discarding everything a language model doesn't need to know.

This is where research on related frameworks becomes relevant. Work like OmniSVG and OmniTokenizer explores similar space, asking how to tokenize visual content in ways that machines can learn from. OmniLottie applies these principles specifically to animation, recognizing that animation tokenization has different demands than still image tokenization.

The transformation achieves something important: the resulting token sequence looks more like natural language instructions than code. "SHAPE rect [0, 0, 100, 100]" followed by "ANIMATE position [ease_out, 500ms, keyframes]" reads almost like a recipe. This structural similarity to language is why pretrained language models can handle it at all.

Building on language models

Once you have the right tokenization, something unexpected happens: pretrained vision language models become capable of animation generation. These models were trained to follow instructions, to understand images and text in relation to each other, to generate coherent outputs conditioned on multimodal input. By converting animations into tokenized sequences that look more like instructions than like code, the researchers discovered that these models could be adapted to animation generation without extensive retraining from scratch.

The system takes multimodal input, your text description combined with reference images or sketches, and produces a sequence of animation tokens. The language model has already learned to reason about relationships, constraints, and composition from its pretraining on images and text. All it has to do is extend that reasoning to a new kind of output sequence. It's not learning animation from first principles, it's learning to follow a new instruction set that it's already structurally equipped to handle.

This is transfer learning in its purest form. The animation generation problem becomes a conditional generation problem, which is something language models do exceptionally well. The model doesn't need to reinvent the wheel on understanding spatial relationships or temporal reasoning, it borrows those capabilities from its pretraining. What it learns new is the specific mapping from human instructions to animation tokens.

The architecture builds on existing pretrained models, which means the system benefits from all the scale and quality those models already possess. You get to stand on the shoulders of models trained on billions of images and trillions of tokens. The bottleneck shifts from "can the model learn animation" to "is our tokenization good enough that the model can map instruction to token effectively."

The dataset that made it possible

Even with the right representation and architecture, you need data. Professional vector animations are expensive and rare. They're usually locked behind proprietary design software, created by specialists, valued enough that companies don't release them casually.

The researchers curated MMLottie-2M, a dataset of 2 million professionally designed vector animations paired with text descriptions and visual references. This is the unsexy but essential part of AI research. Someone had to actually gather and annotate two million animations. The scale is remarkable, because animation data has been historically scarce in machine learning. Most animation research worked with thousands of examples, not millions.

Building a dataset at this scale required thoughtful engineering. The annotations needed to capture both what the animation looks like and what it does. Text descriptions had to be rich enough to capture the nuance of motion while remaining concise enough for practical training. Visual references had to show the key frames or context without just duplicating the animation itself.

Without this dataset, OmniLottie would be a nice idea with no way to validate it. The method only works because there's sufficient high-quality training data. The dataset itself becomes a contribution to the research community, a resource that future work on animation generation can build on.

How well does it actually work

The real test is whether the system generates animations that feel intentional and alive. Can it take a text description like "a bouncing ball getting progressively faster" and produce an animation that matches? Can it understand a sketch and add motion to it? Can it handle multimodal input where you combine text, images, and visual references?

The results validate the approach. OmniLottie generates vivid and semantically aligned vector animations that adhere closely to human instructions. The tokenization strategy pays off, the transfer from pretrained models works, and the scale of training data supports learning complex animation patterns.

What's particularly interesting is what the system does well and where it still struggles. It excels at generating animations that follow explicit instructions, where the motion is directly specified in the prompt. It handles shape generation and animation curves reliably. It understands compound animations, multiple objects moving in relation to each other.

The limitations are instructive. Complex physics simulations remain challenging, because physics requires precise numerical simulation rather than learned patterns. Long sequences of animations where one motion depends on the result of the previous one can accumulate errors. Animations with highly specific timing constraints sometimes miss the mark.

These limitations aren't flaws in the approach, they're boundaries of what's learnable from data. They also suggest directions for future work. One could imagine hybrid systems that combine learned animation generation with physics solvers, or systems that iteratively refine animations conditioned on feedback.

The core contribution stands: by designing the right tokenization, the researchers unlocked the ability for language models to generate professional-quality vector animations. They solved the representation problem that was blocking this capability. Related work in multimodal generation like ShapeLLM and OmniGen2 suggests this approach extends to other domains. Whenever you have a format optimized for execution but opaque to learning, the same principle applies: design a tokenizer that exposes semantic structure.

The breakthrough isn't a new algorithm or a massive dataset alone, it's a new language. By translating Lottie into token sequences, the researchers gave pretrained language models the ability to think about animation the way they think about text. That's the insight that makes everything else work.

This is a Plain English Papers summary of a research paper called OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.