(1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: [email protected]);
(2) Bichen Wu, Meta GenAI and Corresponding author;
(3) Jialiang Wang, Meta GenAI;
(4) Licheng Yu, Meta GenAI;
(5) Kunpeng Li, Meta GenAI;
(6) Yinan Zhao, Meta GenAI;
(7) Ishan Misra, Meta GenAI;
(8) Jia-Bin Huang, Meta GenAI;
(9) Peizhao Zhang, Meta GenAI (Email: [email protected]);
(10) Peter Vajda, Meta GenAI (Email: [email protected]);
(11) Diana Marculescu, The University of Texas at Austin (Email: [email protected]).
Benefiting from large-scale pre-trained text-to-image (T2I) diffusion models [2, 11, 39, 40], progress has been made in text-based image-to-image (I2I) generation [10, 14, 24, 30, 31, 33, 43, 51]. Beginning with image editing methods, Prompt-to-prompt [14] and PNP [43] manipulate the attentions in the diffusion process to edit images according to target prompts. Instruct-pix2pix [4] goes a step further by training an I2I model that can directly interpret and follow human instructions. More recently, I2I methods have extended user control by allowing the inclusion of reference images to precisely define target image compositions. Notably, ControlNet, T2I-Adapter [31], and Composer [20] have introduced spatial conditions, such as depth maps, enabling generated images to replicate the structure of the reference. Our method falls into this category as we aim to generate a new video while incorporating the spatial composition in the original one. However, it’s important to note that simply applying these I2I methods to individual video frames can yield unsatisfactory results due to the inherent challenge of maintaining consistency across independently generated frames (per-frame results can be found in Section 5.2).
To jointly generate coherent multiple frames, it is now a common standard to inflate image models to video: replacing spatial-only attention with spatial-temporal attention. For instance, Tune-A-Video [46], Vid-to-vid zero [44], Text2videozero [25], Pix2Video [6] and FateZero [35] performs crossframe attention of each frame on anchor frame, usually the first frame and the previous frame to preserve appearance consistency. TokenFlow [13] further explicitly enforces semantic correspondences of diffusion features across frames to improve consistency. Furthermore, more works are adding spatial controls, e.g., depth map to constraint the generation. Zhang’s ControlVideo [50] proposes to extend image-based ControlNet to the video domain with full cross-frame attention. Gen-1 [12], VideoComposer [45], Control-A-Video [7] and Zhao’s ControlVideo [52] train V2V models with paired spatial controls and video data. Our method falls in the same category but it also includes the imperfect temporal flow information into the training process alongside spatial controls. This addition enhances the overall robustness and adaptability of our method.
Another line of work is representing video as 2D images, as seen in methods like layered atlas [23], Text2Live [3], shape-aware-edit [26], and CoDeF [32]. However, these methods often require per-video optimization and they also face performance degradation when dealing with large motion, which challenges the creation of image representations.
The use of optical flow to propagate edits across frames has been explored even before the advent of diffusion models, as demonstrated by the well-known Ebsythn [22] approach. In the era of diffusion models, Chu’s Video ControlNet [9] employs the ground-truth (gt) optical flow from synthetic videos to enforce temporal consistency among corresponding pixels across frames. However, it’s important to note that ground-truth flow is typically unavailable in real-world videos, where flow is commonly estimated using pretrained models [42, 47, 48]. Recent methods like Rerender [49], MeDM [8], and Hu’s VideoControlNet [19] use estimated flow to generate occlusion masks for in-painting. In other words, these methods ”force” the overlapped regions to remain consistent based on flow estimates. Similarly, CoDeF [32] utilizes flow to guide the generation of canonical images. These approaches all assume that flow could be treated as an accurate supervision signal that must be strictly adhered to. In contrast, our FlowVid recognizes the imperfections inherent in flow estimation and presents an approach that leverages its potential without imposing rigid constraints.
This paper is available on arxiv under CC 4.0 license.