Discussing TokenFlow: A Clear and Simple Explanation

Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Method 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 6 Discussion 7 Acknowledgement and References A Implementation Details 6 DISCUSSION We presented a new framework for text-driven video editing using an image diffusion model. We study the internal representation of a video in the diffusion feature space, and demonstrate that consistent video editing can be achieved via consistent diffusion feature representation during the generation. Our method outperforms existing baselines, demonstrating a significant improvement in temporal consistency. As for limitations, our method is tailored to preserve the motion of the original video, and as such, it cannot handle edits that require structural changes (Fig 7.) Moreover, our method is built upon a diffusion-based image editing technique to allow the structure preservation of the original frames. When the image-editing technique fails to preserve the structure, our method enforces correspondences that are meaningless in the edited frames, resulting in visual artifacts. Lastly, the LDM decoder introduces some high frequency flickering (Blattmann et al., 2023). A possible solution for this would be to combine our framework with an improved decoder (e.g., Blattmann et al. (2023), Zhu et al. (2023)). We note that this minor level of flickering can be easily eliminated with exiting post-process deflickering (see SM). Our work shed new light on the internal representation of natural videos in the space of diffusion models (e.g., temporal redundancies), and how they can be leveraged for enhancing video synthesis. We believe it can inspire future research in harnessing image models for video tasks, and for the design of text-to-video models. This paper is available on arxiv under CC BY 4.0 DEED DEED license. Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Method 4 Method 4.1 Key Sample and Joint Editing 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 4.2 Edit Propagation Via TokenFlow 5 Results 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 5.3 Ablation Study 6 Discussion 6 Discussion 7 Acknowledgement and References 7 Acknowledgement and References A Implementation Details A Implementation Details 6 DISCUSSION We presented a new framework for text-driven video editing using an image diffusion model. We study the internal representation of a video in the diffusion feature space, and demonstrate that consistent video editing can be achieved via consistent diffusion feature representation during the generation. Our method outperforms existing baselines, demonstrating a significant improvement in temporal consistency. As for limitations, our method is tailored to preserve the motion of the original video, and as such, it cannot handle edits that require structural changes (Fig 7.) Moreover, our method is built upon a diffusion-based image editing technique to allow the structure preservation of the original frames. When the image-editing technique fails to preserve the structure, our method enforces correspondences that are meaningless in the edited frames, resulting in visual artifacts. Lastly, the LDM decoder introduces some high frequency flickering (Blattmann et al., 2023). A possible solution for this would be to combine our framework with an improved decoder (e.g., Blattmann et al. (2023), Zhu et al. (2023)). We note that this minor level of flickering can be easily eliminated with exiting post-process deflickering (see SM). Our work shed new light on the internal representation of natural videos in the space of diffusion models (e.g., temporal redundancies), and how they can be leveraged for enhancing video synthesis. We believe it can inspire future research in harnessing image models for video tasks, and for the design of text-to-video models. This paper is available on arxiv under CC BY 4.0 DEED DEED license. This paper is available on arxiv under CC BY 4.0 DEED DEED license. available on arxiv Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Authors: Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science.