Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Method 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 6 Discussion 7 Acknowledgement and References A Implementation Details ABSTRACT The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-toimage editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. 1 INTRODUCTION The evolution of text-to-image models has recently facilitated advances in image editing and content creation, allowing users to control various proprieties of both generated and real images. Nevertheless, expanding this exciting progress to video is still lagging behind. A surge of large-scale text-to-video generative models has emerged, demonstrating impressive results in generating clips solely from textual descriptions. However, despite the progress made in this area, existing video models are still in their infancy, being limited in resolution, video length, or the complexity of video dynamics they can represent. In this paper, we harness the power of a state-of-the-art pre-trained text-to-image model for the task of text-driven editing of natural videos. Specifically, our goal is to generate high-quality videos that adhere to the target edit expressed by an input text prompt, while preserving the spatial layout and motion of the original video. The main challenge in leveraging an image diffusion model for video editing is to ensure that the edited content is consistent across all video frames – ideally, each physical point in the 3D world undergoes coherent modifications across time. Existing and concurrent video editing methods that are based on image diffusion models have demonstrated that global appearance coherency across the edited frames can be achieved by extending the self-attention module to include multiple frames (Wu et al., 2022; Khachatryan et al., 2023b; Ceylan et al., 2023; Qi et al., 2023). Nevertheless, this approach is insufficient for achieving the desired level of temporal consistency, as motion in the video is only implicitly preserved through the attention module. Consequently, professionals or semi-professionals users often resort to elaborate video editing pipelines that entail additional manual work. In this work, we propose a framework to tackle this challenge by explicitly enforcing the original inter-frame correspondences on the edit. Intuitively, natural videos contain redundant information across frames, e.g., depict similar appearance and shared visual elements. Our key observation is that the internal representation of the video in the diffusion model exhibits similar properties. That is, the level of redundancy and temporal consistency of the frames in the RGB space and in the diffusion feature space are tightly correlated. Based on this observation, the pillar of our approach is to achieve consistent edit by ensuring that the features of the edited video are consistent across frames. Specifically, we enforce that the edited features convey the same inter-frame correspondences and redundancy as the original video features. To do so, we leverage the original inter-frame feature correspondences, which are readily available by the model. This leads to an effective method that directly propagates the edited diffusion features based on the original video dynamics. This approach allows us to harness the generative prior of state-of-the-art image diffusion model without additional training or fine-tuning, and can work in conjunction with an off-the-shelf diffusion-based image editing method (e.g., Meng et al. (2022); Hertz et al. (2022); Zhang & Agrawala (2023); Tumanyan et al. (2023)). To summarize, we make the following key contributions: • A technique, dubbed TokenFlow, that enforces semantic correspondences of diffusion features across frames, allowing to significantly increase temporal consistency in videos generated by a text-to-image diffusion model. • Novel empirical analysis studying the proprieties of diffusion features across a video. • State-of-the-art editing results on diverse videos, depicting complex motions. This paper is available on arxiv under CC BY 4.0 DEED DEED license. Authors:
(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;
(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;
(3) Shai Bagon, Weizmann Institute of Science;
(4) Tali Dekel, Weizmann Institute of Science. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Method 4 Method 4.1 Key Sample and Joint Editing 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 4.2 Edit Propagation Via TokenFlow 5 Results 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 5.3 Ablation Study 6 Discussion 6 Discussion 7 Acknowledgement and References 7 Acknowledgement and References A Implementation Details A Implementation Details ABSTRACT The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-toimage editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. 1 INTRODUCTION The evolution of text-to-image models has recently facilitated advances in image editing and content creation, allowing users to control various proprieties of both generated and real images. Nevertheless, expanding this exciting progress to video is still lagging behind. A surge of large-scale text-to-video generative models has emerged, demonstrating impressive results in generating clips solely from textual descriptions. However, despite the progress made in this area, existing video models are still in their infancy, being limited in resolution, video length, or the complexity of video dynamics they can represent. In this paper, we harness the power of a state-of-the-art pre-trained text-to-image model for the task of text-driven editing of natural videos. Specifically, our goal is to generate high-quality videos that adhere to the target edit expressed by an input text prompt, while preserving the spatial layout and motion of the original video. The main challenge in leveraging an image diffusion model for video editing is to ensure that the edited content is consistent across all video frames – ideally, each physical point in the 3D world undergoes coherent modifications across time. Existing and concurrent video editing methods that are based on image diffusion models have demonstrated that global appearance coherency across the edited frames can be achieved by extending the self-attention module to include multiple frames (Wu et al., 2022; Khachatryan et al., 2023b; Ceylan et al., 2023; Qi et al., 2023). Nevertheless, this approach is insufficient for achieving the desired level of temporal consistency, as motion in the video is only implicitly preserved through the attention module. Consequently, professionals or semi-professionals users often resort to elaborate video editing pipelines that entail additional manual work. In this work, we propose a framework to tackle this challenge by explicitly enforcing the original inter-frame correspondences on the edit. Intuitively, natural videos contain redundant information across frames, e.g., depict similar appearance and shared visual elements. Our key observation is that the internal representation of the video in the diffusion model exhibits similar properties. That is, the level of redundancy and temporal consistency of the frames in the RGB space and in the diffusion feature space are tightly correlated. Based on this observation, the pillar of our approach is to achieve consistent edit by ensuring that the features of the edited video are consistent across frames. Specifically, we enforce that the edited features convey the same inter-frame correspondences and redundancy as the original video features. To do so, we leverage the original inter-frame feature correspondences, which are readily available by the model. This leads to an effective method that directly propagates the edited diffusion features based on the original video dynamics. This approach allows us to harness the generative prior of state-of-the-art image diffusion model without additional training or fine-tuning, and can work in conjunction with an off-the-shelf diffusion-based image editing method (e.g., Meng et al. (2022); Hertz et al. (2022); Zhang & Agrawala (2023); Tumanyan et al. (2023)). To summarize, we make the following key contributions: • A technique, dubbed TokenFlow, that enforces semantic correspondences of diffusion features across frames, allowing to significantly increase temporal consistency in videos generated by a text-to-image diffusion model. • Novel empirical analysis studying the proprieties of diffusion features across a video. • State-of-the-art editing results on diverse videos, depicting complex motions. This paper is available on arxiv under CC BY 4.0 DEED DEED license. This paper is available on arxiv under CC BY 4.0 DEED DEED license. available on arxiv Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Authors: Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Text-Driven Image & Video Synthesis: A Closer Look

What Is TokenFlow?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

10 Best AI Content Generation Tools for All Your Content Needs in 2022

99.9% of Content Will Be AI-Generated by 2025: Does Anyone Care?

A.I. For Online Content Creation: How Far Have They Come?

AI is Impacting the Writing Industry and Coming for My Business

Beyond ChatGPT: How Else Are Content Marketers Using AI Today?

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

10 Best AI Content Generation Tools for All Your Content Needs in 2022

99.9% of Content Will Be AI-Generated by 2025: Does Anyone Care?

A.I. For Online Content Creation: How Far Have They Come?

AI is Impacting the Writing Industry and Coming for My Business

Beyond ChatGPT: How Else Are Content Marketers Using AI Today?

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps