Authors:
(1) Yuwei Guo, The Chinese University of Hong Kong;
(2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author;
(3) Anyi Rao, Stanford University;
(4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory;
(5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory;
(6) Yu Qiao, Shanghai Artificial Intelligence Laboratory;
(7) Maneesh Agrawala, Stanford University;
(8) Dahua Lin, Shanghai Artificial Intelligence Laboratory;
(9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Table of Links Abstract and 1 Introduction 2 Work Related 3 Preliminary AnimateDiff 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.2 Learn Motion Priors with Motion Module 4.3 Adapt to New Motion Patterns with MotionLora 4.4 AnimateDiff in Practice 5 Experiments and 5.1 Qualitative Results 5.2 Qualitative Comparison 5.3 Ablative Study 5.4 Controllable Generation 6 Conclusion 7 Ethics Statement 8 Reproducibility Statement, Acknowledgement and References 4.3 ADAPT TO NEW MOTION PATTERNS WITH MOTIONLORA While the pre-trained motion module captures general motion priors, a question arises when we need to effectively adapt it to new motion patterns such as camera zooming, panning and rolling, etc., with a small number of reference videos and training iterations. Such efficiency is essential for users who cannot afford expensive pre-training costs but would like to fine-tune the motion module for specific effects. Here comes the last stage of AnimateDiff, also dubbed as MotionLoRA (Fig. 3), an efficient fine-tuning approach for motion personalization. Considering the architecture of the motion module and the limited number of reference videos, we add LoRA layers to the self-attention layers of the motion module in the inflated model described in Sec. 4.2, then train these LoRA layers on the reference videos of new motion patterns. We experiment with several shot types and get the reference videos via rule-based data augmentation. For instance, to get videos with zooming effects, we augment the videos by gradually reducing (zoom-in) or enlarging (zoom-out) the cropping area of video frames along the temporal axis. We demonstrate that our MotionLoRA can achieve promising results even with as few as 20 ∼ 50 reference videos, 2,000 training iterations (around 1 ∼ 2 hours) as well as about 30M storage space, enabling efficient model tuning and sharing among users. Benefited by the low-rank property, MotionLoRA also has the composition capability. Namely, individually trained MotionLoRA models can be combined to achieve composed motion effects at inference time. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Yuwei Guo, The Chinese University of Hong Kong; (2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author; (3) Anyi Rao, Stanford University; (4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory; (5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory; (6) Yu Qiao, Shanghai Artificial Intelligence Laboratory; (7) Maneesh Agrawala, Stanford University; (8) Dahua Lin, Shanghai Artificial Intelligence Laboratory; (9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Authors: Authors: (1) Yuwei Guo, The Chinese University of Hong Kong; (2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author; (3) Anyi Rao, Stanford University; (4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory; (5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory; (6) Yu Qiao, Shanghai Artificial Intelligence Laboratory; (7) Maneesh Agrawala, Stanford University; (8) Dahua Lin, Shanghai Artificial Intelligence Laboratory; (9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Work Related 2 Work Related 3 Preliminary 3 Preliminary AnimateDiff AnimateDiff 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.2 Learn Motion Priors with Motion Module 4.2 Learn Motion Priors with Motion Module 4.3 Adapt to New Motion Patterns with MotionLora 4.3 Adapt to New Motion Patterns with MotionLora 4.4 AnimateDiff in Practice 4.4 AnimateDiff in Practice 5 Experiments and 5.1 Qualitative Results 5 Experiments and 5.1 Qualitative Results 5.2 Qualitative Comparison 5.2 Qualitative Comparison 5.3 Ablative Study 5.3 Ablative Study 5.4 Controllable Generation 5.4 Controllable Generation 6 Conclusion 6 Conclusion 7 Ethics Statement 7 Ethics Statement 8 Reproducibility Statement, Acknowledgement and References 8 Reproducibility Statement, Acknowledgement and References 4.3 ADAPT TO NEW MOTION PATTERNS WITH MOTIONLORA While the pre-trained motion module captures general motion priors, a question arises when we need to effectively adapt it to new motion patterns such as camera zooming, panning and rolling, etc., with a small number of reference videos and training iterations. Such efficiency is essential for users who cannot afford expensive pre-training costs but would like to fine-tune the motion module for specific effects. Here comes the last stage of AnimateDiff, also dubbed as MotionLoRA (Fig. 3), an efficient fine-tuning approach for motion personalization. Considering the architecture of the motion module and the limited number of reference videos, we add LoRA layers to the self-attention layers of the motion module in the inflated model described in Sec. 4.2, then train these LoRA layers on the reference videos of new motion patterns. We experiment with several shot types and get the reference videos via rule-based data augmentation. For instance, to get videos with zooming effects, we augment the videos by gradually reducing (zoom-in) or enlarging (zoom-out) the cropping area of video frames along the temporal axis. We demonstrate that our MotionLoRA can achieve promising results even with as few as 20 ∼ 50 reference videos, 2,000 training iterations (around 1 ∼ 2 hours) as well as about 30M storage space, enabling efficient model tuning and sharing among users. Benefited by the low-rank property, MotionLoRA also has the composition capability. Namely, individually trained MotionLoRA models can be combined to achieve composed motion effects at inference time. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Adapting Motion Patterns Efficiently with MotionLoRA in AnimateDiff

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Ablative Study on Domain Adapter, Motion Module Design, and MotionLoRA Efficiency

How to Transform Static Text-to-Image Models into Dynamic Animation Generators

Text-to-Image Diffusion Models and Personalized Animation Techniques

How Stable Diffusion and LoRA Work

Bridging Domain Gaps with a Domain Adapter for Higher-Quality Animation

Mastering Motion Dynamics in Animation with Temporal Transformers

Ablative Study on Domain Adapter, Motion Module Design, and MotionLoRA Efficiency

How to Transform Static Text-to-Image Models into Dynamic Animation Generators

Text-to-Image Diffusion Models and Personalized Animation Techniques

How Stable Diffusion and LoRA Work

Bridging Domain Gaps with a Domain Adapter for Higher-Quality Animation

Mastering Motion Dynamics in Animation with Temporal Transformers

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps