TokenFlow's Implementation Details: Everything That We Used

Written by kinetograph | Published 2024/12/18
Tech Story Tags: stable-diffusion | diffusion-models | ddim-deterministic-sampling | ddim-inversion | tune-a-video | fate-zero | tokenflow | what-is-tokenflow

TLDRWe use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage.via the TL;DR App

Table of Links

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Method

4.1 Key Sample and Joint Editing

4.2 Edit Propagation Via TokenFlow

5 Results

5.1 Qualitative Evaluation and 5.2 Quantitative Evaluation

5.3 Ablation Study

6 Discussion

7 Acknowledgement and References

A Implementation Details

A IMPLEMENTATION DETAILS

StableDiffusion. We use Stable Diffusion as our pre-trained text-to-image model; we use the StableDiffusion-v-2-1 checkpoint provided via official HuggingFace webpage.

DDIM inversion. In all of our experiments, we use DDIM deterministic sampling with 50 steps. For inverting the video, we follow Tumanyan et al. (2023) and use DDIM inversion with classifierfree guidance scale of 1 and 1000 forward steps; and extract the self-attention input tokens from this process similarly to Qi et al. (2023).

Runtime. Since we don’t compute the attention module on most video frames (i.e., we only compute the self-attention output on the keyframes) our method is efficient in run-time, and the sampling of the video reduces the time of per-frame editing by 20%. The inversion process with 1000 steps is the main bottleneck of our method in terms of run-time, and in many cases a significantly smaller amount of steps is suffieicent (e.g. 50). Table 3 reports runtime comparisons using 50 steps in all methods. Notably, our sampling time is indeed faster than that of per-frame editing (PnP).

Baselines. For running the baseline of Tune-a-video (Wu et al., 2022) we used their official repository. For Gen-1 (Esser et al., 2023) we used their platform on Runaway website. This platform outputs a video that is not in the same length and frame-rate as the input video; therefore, we could not compute the warping error on their results. For text-to-video-zero (Khachatryan et al., 2023b) we used their official repository, with their depth conditioning configuration. For Fate-Zero (Qi et al., 2023) with used their official repository, and verified the run configurations with the authors.

This paper is available on arxiv under CC BY 4.0 DEED DEED license.

Authors:

(1) Michal Geyer, Weizmann Institute of Science and Indicates equal contribution;

(2) Omer Bar-Tal, Weizmann Institute of Science and Indicates equal contribution;

(3) Shai Bagon, Weizmann Institute of Science;

(4) Tali Dekel, Weizmann Institute of Science.


Written by kinetograph | The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover cutting edge tech for video editing.
Published by HackerNoon on 2024/12/18