Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Method 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 6 Discussion 7 Acknowledgement and References A Implementation Details 5 RESULTS We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting animals, food, humans, and various objects in motion. The spatial resolution of the videos is 384× 672 or 512×512 pixels, and they consist of 40 to 200 frames. We use various text prompts on each video to obtain diverse editing results. Our evaluation dataset comprises of 61 text-video pairs. We utilize PnP-Diffusion (Tumanyan et al., 2023) as the frame editing method, and we use the same hyper-parameters for all our results. PnP-Diffusion may fail to accurately preserve the structure of each frame due to inaccurate DDIM inversion (see Fig. 3, middle column, right frame: the dog’s head is distorted). Our method improves robustness to this, as multiple frames contribute to the generation of each frame in the video. Our framework can be combined with any diffusion-based image editing technique that accurately preserves the structure of the images; results with different image editing techniques (e.g. Meng et al. (2022); Zhang & Agrawala (2023)) are available in the SM. Fig. 5 and 1 show sample frames from the edited videos. Our edits are temporally consistent and adhere to the edit prompt. The man’s head is changed to Van-Gogh or marble (top left); importantly, the man’s identity and the scene’s background are consistent throughout the video. The patterns of the polygonal wolf (bottom left) are the same across time: the body is consistently orange while the chest is blue. We refer the reader to the SM for implementation details and video results. Baselines. We compare our method to state-of-the-art, and concurrent works: (i) Fate-Zero (Qi et al., 2023) and (ii) Text2Video-Zero (Khachatryan et al., 2023b), that utilize a text-to-image model for video editing using self-attention inflation. (iii) Re-render a Video (Yang et al., 2023) that edits keyframes by adding optical flow optimization to self-attention inflation of an image model, and then propagates the edit from the keyframes to the rest of the video using an off-the-shelf propagation method. (iv) Tune-a-Video (Wu et al., 2022) that fine-tunes the text-to-image model on the given test video. (v) Gen-1 (Esser et al., 2023), a video diffusion model that was trained on a large-scale image and video dataset. (vi) Per-frame diffusion-based image editing baseline, PnP-Diffusion (Tumanyan et al., 2023). We additionally consider the two following baselines: (i) Text2LIVE (Bar-Tal et al., 2022) which utilize a layered video representation (NLA) (Kasten et al., 2021) and perform test-time training using CLIP losses. Note that NLA requires foreground/background separation masks and takes ∼10 hours to train. (ii) Applying PnP-Diffusion on a single keyframe and propagating the edit to the entire video using Jamriska et al. (2019). This paper is available on arxiv under CC BY 4.0 DEED DEED license. Authors:
(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;
(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;
(3) Shai Bagon, Weizmann Institute of Science;
(4) Tali Dekel, Weizmann Institute of Science. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Method 4 Method 4.1 Key Sample and Joint Editing 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 4.2 Edit Propagation Via TokenFlow 5 Results 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 5.3 Ablation Study 6 Discussion 6 Discussion 7 Acknowledgement and References 7 Acknowledgement and References A Implementation Details A Implementation Details 5 RESULTS We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting animals, food, humans, and various objects in motion. The spatial resolution of the videos is 384× 672 or 512×512 pixels, and they consist of 40 to 200 frames. We use various text prompts on each video to obtain diverse editing results. Our evaluation dataset comprises of 61 text-video pairs. We utilize PnP-Diffusion (Tumanyan et al., 2023) as the frame editing method, and we use the same hyper-parameters for all our results. PnP-Diffusion may fail to accurately preserve the structure of each frame due to inaccurate DDIM inversion (see Fig. 3, middle column, right frame: the dog’s head is distorted). Our method improves robustness to this, as multiple frames contribute to the generation of each frame in the video. Our framework can be combined with any diffusion-based image editing technique that accurately preserves the structure of the images; results with different image editing techniques (e.g. Meng et al. (2022); Zhang & Agrawala (2023)) are available in the SM. Fig. 5 and 1 show sample frames from the edited videos. Our edits are temporally consistent and adhere to the edit prompt. The man’s head is changed to Van-Gogh or marble (top left); importantly, the man’s identity and the scene’s background are consistent throughout the video. The patterns of the polygonal wolf (bottom left) are the same across time: the body is consistently orange while the chest is blue. We refer the reader to the SM for implementation details and video results. Baselines. We compare our method to state-of-the-art, and concurrent works: (i) Fate-Zero (Qi et al., 2023) and (ii) Text2Video-Zero (Khachatryan et al., 2023b), that utilize a text-to-image model for video editing using self-attention inflation. (iii) Re-render a Video (Yang et al., 2023) that edits keyframes by adding optical flow optimization to self-attention inflation of an image model, and then propagates the edit from the keyframes to the rest of the video using an off-the-shelf propagation method. (iv) Tune-a-Video (Wu et al., 2022) that fine-tunes the text-to-image model on the given test video. (v) Gen-1 (Esser et al., 2023), a video diffusion model that was trained on a large-scale image and video dataset. (vi) Per-frame diffusion-based image editing baseline, PnP-Diffusion (Tumanyan et al., 2023). We additionally consider the two following baselines: (i) Text2LIVE (Bar-Tal et al., 2022) which utilize a layered video representation (NLA) (Kasten et al., 2021) and perform test-time training using CLIP losses. Note that NLA requires foreground/background separation masks and takes ∼10 hours to train. (ii) Applying PnP-Diffusion on a single keyframe and propagating the edit to the entire video using Jamriska et al. (2019). Baselines. This paper is available on arxiv under CC BY 4.0 DEED DEED license. This paper is available on arxiv under CC BY 4.0 DEED DEED license. available on arxiv Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Authors: Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

TokenFlow: Taking a Look at Our Results and Evaluating Our Methods

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

How to Edit Propagation Via TokenFlow

TokenFlow's Implementation Details: Everything That We Used

Qualitative Evaluation and Quantitative Evaluation: Comparing Our Method to Others

AI Image Magic: Creating HD Photos and Funny Cartoons With StyleGANEX

This AI Removes Unwanted Objects From Your Images!

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

How to Edit Propagation Via TokenFlow

TokenFlow's Implementation Details: Everything That We Used

Qualitative Evaluation and Quantitative Evaluation: Comparing Our Method to Others

AI Image Magic: Creating HD Photos and Funny Cartoons With StyleGANEX

This AI Removes Unwanted Objects From Your Images!

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps