Authors:
(1) Yuwei Guo, The Chinese University of Hong Kong;
(2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author;
(3) Anyi Rao, Stanford University;
(4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory;
(5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory;
(6) Yu Qiao, Shanghai Artificial Intelligence Laboratory;
(7) Maneesh Agrawala, Stanford University;
(8) Dahua Lin, Shanghai Artificial Intelligence Laboratory;
(9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Table of Links Abstract and 1 Introduction 2 Work Related 3 Preliminary AnimateDiff 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.2 Learn Motion Priors with Motion Module 4.3 Adapt to New Motion Patterns with MotionLora 4.4 AnimateDiff in Practice 5 Experiments and 5.1 Qualitative Results 5.2 Qualitative Comparison 5.3 Ablative Study 5.4 Controllable Generation 6 Conclusion 7 Ethics Statement 8 Reproducibility Statement, Acknowledgement and References8 Reproducibility Statement, Acknowledgement and References 2 RELATED WORK Text-to-image diffusion models. Diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Song et al., 2020) for text-to-image (T2I) generation (Gu et al., 2022; Mokady et al., 2023; Podell et al., 2023; Ding et al., 2021; Zhou et al., 2022b; Ramesh et al., 2021; Li et al., 2022) have gained significant attention in both academic and non-academic communities recently. GLIDE (Nichol et al., 2021) introduced text conditions and demonstrated that incorporating classifier guidance leads to more pleasing results. DALL-E2 (Ramesh et al., 2022) improves text-image alignment by leveraging the CLIP (Radford et al., 2021) joint feature space. Imagen (Saharia et al., 2022) incorporates a large language model (Raffel et al., 2020) and a cascade architecture to achieve photorealistic results. Latent Diffusion Model (Rombach et al., 2022), also known as Stable Diffusion, moves the diffusion process to the latent space of an auto-encoder to enhance efficiency. eDiff-I (Balaji et al., 2022) employs an ensemble of diffusion models specialized for different generation stages. Personalizing T2I models. To facilitate the creation with pre-trained T2Is, many works focus on efficient model personalization (Shi et al., 2023; Lu et al., 2023; Dong et al., 2022; Kumari et al., 2023), i.e., introducing concepts or styles to the base T2I using reference images. The most straightforward approach to achieve this is complete fine-tuning of the model. Despite its potential to significantly enhance overall quality, this practice can lead to catastrophic forgetting (Kirkpatrick et al., 2017; French, 1999) when the reference image set is small. Instead, DreamBooth (Ruiz et al., 2023) fine-tunes the entire network with preservation loss and uses only a few images. Textual Inversion (Gal et al., 2022) optimize a token embedding for each new concept. Low-Rank Adaptation (LoRA) (Hu et al., 2021) facilitates the above fine-tuning process by introducing additional LoRA layers to the base T2I and optimizing only the weight residuals. There are also encoder-based approaches that address the personalization problem (Gal et al., 2023; Jia et al., 2023). In our work, we focus on tuning-based methods, including overall fine-tuning, DreamBooth (Ruiz et al., 2023), and LoRA (Hu et al., 2021), as they preserve the original feature space of the base T2I. Animating personalized T2Is. There are not many existing works regarding animating personalized T2Is. Text2Cinemagraph (Mahapatra et al., 2023) proposed to generate cinematography via flow prediction. In the field of video generation, it is common to extend a pre-trained T2I with temporal structures. Existing works (Esser et al., 2023; Zhou et al., 2022a; Singer et al., 2022; Ho et al., 2022b,a; Ruan et al., 2023; Luo et al., 2023; Yin et al., 2023b,a; Wang et al., 2023b; Hong et al., 2022; Luo et al., 2023) mostly update all parameters and modify the feature space of the original T2I and is not compatible with personalized ones. Align-Your-Latents (Blattmann et al., 2023) shows that the frozen image layers in a general video generator can be personalized. Recently, some video generation approaches have shown promising results in animating a personalized T2I model. Tune-a-Video (Wu et al., 2023) fine-tune a small number of parameters on a single video. Text2Video-Zero (Khachatryan et al., 2023) introduces a training-free method to animate a pre-trained T2I via latent wrapping based on a pre-defined affine matrix. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Yuwei Guo, The Chinese University of Hong Kong; (2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author; (3) Anyi Rao, Stanford University; (4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory; (5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory; (6) Yu Qiao, Shanghai Artificial Intelligence Laboratory; (7) Maneesh Agrawala, Stanford University; (8) Dahua Lin, Shanghai Artificial Intelligence Laboratory; (9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Authors: Authors: (1) Yuwei Guo, The Chinese University of Hong Kong; (2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author; (3) Anyi Rao, Stanford University; (4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory; (5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory; (6) Yu Qiao, Shanghai Artificial Intelligence Laboratory; (7) Maneesh Agrawala, Stanford University; (8) Dahua Lin, Shanghai Artificial Intelligence Laboratory; (9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Work Related 2 Work Related 3 Preliminary 3 Preliminary AnimateDiff AnimateDiff 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.1 Alleviate Negative Effects from Training Data with Domain Adapter 4.2 Learn Motion Priors with Motion Module 4.2 Learn Motion Priors with Motion Module 4.3 Adapt to New Motion Patterns with MotionLora 4.3 Adapt to New Motion Patterns with MotionLora 4.4 AnimateDiff in Practice 4.4 AnimateDiff in Practice 5 Experiments and 5.1 Qualitative Results 5 Experiments and 5.1 Qualitative Results 5.2 Qualitative Comparison 5.2 Qualitative Comparison 5.3 Ablative Study 5.3 Ablative Study 5.4 Controllable Generation 5.4 Controllable Generation 6 Conclusion 6 Conclusion 7 Ethics Statement 7 Ethics Statement 8 Reproducibility Statement, Acknowledgement and References8 Reproducibility Statement, Acknowledgement and References 8 Reproducibility Statement, Acknowledgement and References8 Reproducibility Statement, Acknowledgement and References 2 RELATED WORK Text-to-image diffusion models. Diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Song et al., 2020) for text-to-image (T2I) generation (Gu et al., 2022; Mokady et al., 2023; Podell et al., 2023; Ding et al., 2021; Zhou et al., 2022b; Ramesh et al., 2021; Li et al., 2022) have gained significant attention in both academic and non-academic communities recently. GLIDE (Nichol et al., 2021) introduced text conditions and demonstrated that incorporating classifier guidance leads to more pleasing results. DALL-E2 (Ramesh et al., 2022) improves text-image alignment by leveraging the CLIP (Radford et al., 2021) joint feature space. Imagen (Saharia et al., 2022) incorporates a large language model (Raffel et al., 2020) and a cascade architecture to achieve photorealistic results. Latent Diffusion Model (Rombach et al., 2022), also known as Stable Diffusion, moves the diffusion process to the latent space of an auto-encoder to enhance efficiency. eDiff-I (Balaji et al., 2022) employs an ensemble of diffusion models specialized for different generation stages. Text-to-image diffusion models. Personalizing T2I models. To facilitate the creation with pre-trained T2Is, many works focus on efficient model personalization (Shi et al., 2023; Lu et al., 2023; Dong et al., 2022; Kumari et al., 2023), i.e., introducing concepts or styles to the base T2I using reference images. The most straightforward approach to achieve this is complete fine-tuning of the model. Despite its potential to significantly enhance overall quality, this practice can lead to catastrophic forgetting (Kirkpatrick et al., 2017; French, 1999) when the reference image set is small. Instead, DreamBooth (Ruiz et al., 2023) fine-tunes the entire network with preservation loss and uses only a few images. Textual Inversion (Gal et al., 2022) optimize a token embedding for each new concept. Low-Rank Adaptation (LoRA) (Hu et al., 2021) facilitates the above fine-tuning process by introducing additional LoRA layers to the base T2I and optimizing only the weight residuals. There are also encoder-based approaches that address the personalization problem (Gal et al., 2023; Jia et al., 2023). In our work, we focus on tuning-based methods, including overall fine-tuning, DreamBooth (Ruiz et al., 2023), and LoRA (Hu et al., 2021), as they preserve the original feature space of the base T2I. Personalizing T2I models. Animating personalized T2Is. There are not many existing works regarding animating personalized T2Is. Text2Cinemagraph (Mahapatra et al., 2023) proposed to generate cinematography via flow prediction. In the field of video generation, it is common to extend a pre-trained T2I with temporal structures. Existing works (Esser et al., 2023; Zhou et al., 2022a; Singer et al., 2022; Ho et al., 2022b,a; Ruan et al., 2023; Luo et al., 2023; Yin et al., 2023b,a; Wang et al., 2023b; Hong et al., 2022; Luo et al., 2023) mostly update all parameters and modify the feature space of the original T2I and is not compatible with personalized ones. Align-Your-Latents (Blattmann et al., 2023) shows that the frozen image layers in a general video generator can be personalized. Recently, some video generation approaches have shown promising results in animating a personalized T2I model. Tune-a-Video (Wu et al., 2023) fine-tune a small number of parameters on a single video. Text2Video-Zero (Khachatryan et al., 2023) introduces a training-free method to animate a pre-trained T2I via latent wrapping based on a pre-defined affine matrix. Animating personalized T2Is. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Text-to-Image Diffusion Models and Personalized Animation Techniques

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Ablative Study on Domain Adapter, Motion Module Design, and MotionLoRA Efficiency

How to Transform Static Text-to-Image Models into Dynamic Animation Generators

How Stable Diffusion and LoRA Work

Bridging Domain Gaps with a Domain Adapter for Higher-Quality Animation

Mastering Motion Dynamics in Animation with Temporal Transformers

Adapting Motion Patterns Efficiently with MotionLoRA in AnimateDiff

Ablative Study on Domain Adapter, Motion Module Design, and MotionLoRA Efficiency

How to Transform Static Text-to-Image Models into Dynamic Animation Generators

How Stable Diffusion and LoRA Work

Bridging Domain Gaps with a Domain Adapter for Higher-Quality Animation

Mastering Motion Dynamics in Animation with Temporal Transformers

Adapting Motion Patterns Efficiently with MotionLoRA in AnimateDiff

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps