TokenFlow's Implementation Details: Everything That We Used

Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Method 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 6 Discussion 7 Acknowledgement and References A Implementation Details A IMPLEMENTATION DETAILS StableDiffusion. We use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage. DDIM inversion. In all of our experiments, we use DDIM deterministic sampling with 50 steps. For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifierfree guidance scale of 1 and 1000 forward steps; and extract the self-attention input tokens from this process similarly to Qi et al. (2023). Runtime. Since we don’t compute the attention module on most video frames (i.e., we only compute the self-attention output on the keyframes) our method is efficient in run-time, and the sampling of the video reduces the time of per-frame editing by 20%. The inversion process with 1000 steps is the main bottleneck of our method in terms of run-time, and in many cases a significantly smaller amount of steps is suffieicent (e.g. 50). Table 3 reports runtime comparisons using 50 steps in all methods. Notably, our sampling time is indeed faster than that of per-frame editing (PnP). Baselines. For running the baseline of Tune-a-video (Wu et al., 2022) we used their official repository. For Gen-1 (Esser et al., 2023) we used their platform on Runaway website. This platform outputs a video that is not in the same length and frame-rate as the input video; therefore, we could not compute the warping error on their results. For text-to-video-zero (Khachatryan et al., 2023b) we used their official repository, with their depth conditioning configuration. For Fate-Zero (Qi et al., 2023) with used their official repository, and verified the run configurations with the authors. This paper is available on arxiv under CC BY 4.0 DEED DEED license. Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Method 4 Method 4.1 Key Sample and Joint Editing 4.1 Key Sample and Joint Editing 4.2 Edit Propagation Via TokenFlow 4.2 Edit Propagation Via TokenFlow 5 Results 5 Results 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation 5.3 Ablation Study 5.3 Ablation Study 6 Discussion 6 Discussion 7 Acknowledgement and References 7 Acknowledgement and References A Implementation Details A Implementation Details A IMPLEMENTATION DETAILS StableDiffusion. We use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage. StableDiffusion. DDIM inversion. In all of our experiments, we use DDIM deterministic sampling with 50 steps. For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifierfree guidance scale of 1 and 1000 forward steps; and extract the self-attention input tokens from this process similarly to Qi et al. (2023). DDIM inversion. Runtime. Since we don’t compute the attention module on most video frames (i.e., we only compute the self-attention output on the keyframes) our method is efficient in run-time, and the sampling of the video reduces the time of per-frame editing by 20%. The inversion process with 1000 steps is the main bottleneck of our method in terms of run-time, and in many cases a significantly smaller amount of steps is suffieicent (e.g. 50). Table 3 reports runtime comparisons using 50 steps in all methods. Notably, our sampling time is indeed faster than that of per-frame editing (PnP). Runtime. Baselines. For running the baseline of Tune-a-video (Wu et al., 2022) we used their official repository. For Gen-1 (Esser et al., 2023) we used their platform on Runaway website. This platform outputs a video that is not in the same length and frame-rate as the input video; therefore, we could not compute the warping error on their results. For text-to-video-zero (Khachatryan et al., 2023b) we used their official repository, with their depth conditioning configuration. For Fate-Zero (Qi et al., 2023) with used their official repository, and verified the run configurations with the authors. Baselines. This paper is available on arxiv under CC BY 4.0 DEED DEED license. This paper is available on arxiv under CC BY 4.0 DEED DEED license. available on arxiv Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science. Authors: Authors: (1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution; (2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution; (3) Shai Bagon, Weizmann Institute of Science; (4) Tali Dekel, Weizmann Institute of Science.