4.1 Key Sample and Joint Editing
4.2 Edit Propagation Via TokenFlow
5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation
7 Acknowledgement and References
We evaluate our method on DAVIS videos (Pont-Tuset et al., 2017) and on Internet videos depicting animals, food, humans, and various objects in motion. The spatial resolution of the videos is 384× 672 or 512×512 pixels, and they consist of 40 to 200 frames. We use various text prompts on each video to obtain diverse editing results. Our evaluation dataset comprises of 61 text-video pairs. We utilize PnP-Diffusion (Tumanyan et al., 2023) as the frame editing method, and we use the same hyper-parameters for all our results. PnP-Diffusion may fail to accurately preserve the structure of each frame due to inaccurate DDIM inversion (see Fig. 3, middle column, right frame: the dog’s head is distorted). Our method improves robustness to this, as multiple frames contribute to the generation of each frame in the video. Our framework can be combined with any diffusion-based image editing technique that accurately preserves the structure of the images; results with different image editing techniques (e.g. Meng et al. (2022); Zhang & Agrawala (2023)) are available in the SM. Fig. 5 and 1 show sample frames from the edited videos. Our edits are temporally consistent and adhere to the edit prompt. The man’s head is changed to Van-Gogh or marble (top left); importantly, the man’s identity and the scene’s background are consistent throughout the video. The patterns of the polygonal wolf (bottom left) are the same across time: the body is consistently orange while the chest is blue. We refer the reader to the SM for implementation details and video results.
Baselines. We compare our method to state-of-the-art, and concurrent works: (i) Fate-Zero (Qi et al., 2023) and (ii) Text2Video-Zero (Khachatryan et al., 2023b), that utilize a text-to-image model for video editing using self-attention inflation. (iii) Re-render a Video (Yang et al., 2023) that edits keyframes by adding optical flow optimization to self-attention inflation of an image model, and then propagates the edit from the keyframes to the rest of the video using an off-the-shelf propagation method. (iv) Tune-a-Video (Wu et al., 2022) that fine-tunes the text-to-image model on the given test video. (v) Gen-1 (Esser et al., 2023), a video diffusion model that was trained on a large-scale image and video dataset. (vi) Per-frame diffusion-based image editing baseline, PnP-Diffusion (Tumanyan et al., 2023). We additionally consider the two following baselines: (i) Text2LIVE (Bar-Tal et al., 2022) which utilize a layered video representation (NLA) (Kasten et al., 2021) and perform test-time training using CLIP losses. Note that NLA requires foreground/background separation masks and takes ∼10 hours to train. (ii) Applying PnP-Diffusion on a single keyframe and propagating the edit to the entire video using Jamriska et al. (2019).
This paper is available on arxiv under CC BY 4.0 DEED DEED license.
Authors:
(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;
(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;
(3) Shai Bagon, Weizmann Institute of Science;
(4) Tali Dekel, Weizmann Institute of Science.