4.1 Key Sample and Joint Editing
4.2 Edit Propagation Via TokenFlow
5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation
7 Acknowledgement and References
Fig. 6 provides a qualitative comparison of our method to prominent baselines; please refer to SM for the full videos. Our method (bottom row) outputs videos that better adhere to the edit prompt while maintaining temporal consistency of the resulting edited video, while other methods struggle to meet both these goals. Tune-A-Video (second row) inflates the 2D image model into a video model, and fine-tunes it to overfit the motion of the video; thus, it is suitable for short clips. For long videos it struggles to capture the motion resulting with meaningless edits, e.g., the shiny metal sculpture. Applying PnP for each frame independently (third row) results in exquisite edits adhering to the edit prompt but, as expected, lack any temporal consistency. The results of Gen-1 (fourth row) also suffer from some temporal inconsistencies (the beak of the origami stork changes color). Moreover, their frame quality is significantly worse than that of a text-to-image diffusion model. The edits of Text2Video-Zero and Fate-Zero (fifth and sixth row) suffer from severe jittering as these methods rely heavily on the extended attention mechanism to implicitly encourage consistency. The results of Rerender-a-Video exhibit notable long-range inconsistencies and artifacts arising primarily from their reliance on optical flow estimation for distant frames (e.g. keyframes), which is known to be sub-optimal (See our video results in the SM; when the wolf turns its head, the nose color changes). We provide qualitative comparison to Text2LIVE and to a RGB propagation baseline in the SM.
We evaluate our method in terms of: (i) edit fidelity measured by computing the average similarity between the CLIP embedding (Radford et al., 2021) of each edited frame and the target text prompt; (ii) temporal consistency. Following Ceylan et al. (2023); Lai et al. (2018a), temporal consistency is measured by (a) computing the optical flow of the original video using Teed & Deng (2020), warping the edited frames according to it, and measuring the warping error, and (b) a user study; We adopt a Two-alternative Forced Choice (2AFC) protocol suggested in Kolkin et al. (2019); Park et al. (2020), where participants are shown the input video, ours and a baseline result, and are asked to determine which video is more temporally consistent and better preserves the motion of the original video. The survey consists of 2000-3000 judgments per baseline obtained using Amazon mechanical turk. We note that warpingerror could not be measured for Gen1 since their product platform does not output the same number of input frames. Table 1 compares our method to baselines. Our method achieves the highest CLI score, showing a good fit between the edited video and the input guidance prompt. Furthermore, our method has a low warping error, indicating temporally consistent results. We note that Re-rendera-Video optimizes for the warping error and uses optical flow to propagate the edit, and hence has the lowest warping error; However, this reliance on optical flow often creates artifacts and longrange inconsistencies which are not reflected in the warping error. Nonetheless, they are apparent in the user study, that shows users significantly favoured our method over all baselines in terms of temporal consistency. Additionally, we consider the reference baseline of passing the original video through the LDM auto-encoder without performing editing (LDM recon.). This baseline provides an upper bound on the temporal consistency achievable by LDM auto-encoder. As expected, the CLIP similarity of this baseline is poor as it does not involve any editing. However, this baseline does not achieve zero warp error either due to the imperfect reconstruction of the LDM auto-encoder, which hallucinates high-frequency information
We further evaluate our correspondences and video representation by measuring the accuracy of video reconstruction using TokenFlow. Specifically, we reconstruct the video using the same pipeline of our editing method, only removing the keyframes editing part. Table 2 reports the PSNR and LPIPS distance of this reconstruction, compared to vanilla DDIM reconstruction. As seen, TokenFlow reconstruction slightly improves DDIM inversion, demonstrating robust frame representation. This improvement can be attributed to the keyframe randomization; It increases robustness to challenging frames since each frame is reconstructed from multiple other frames during the generation. Notably, our evaluation focuses on accurate correspondences within the feature space during generation, rather than RGB frame correspondences evaluation, which is not essential to our method.
This paper is available on arxiv under CC BY 4.0 DEED DEED license.
Authors:
(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;
(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;
(3) Shai Bagon, Weizmann Institute of Science;
(4) Tali Dekel, Weizmann Institute of Science.