(1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: jeffliang@utexas.edu);
(2) Bichen Wu, Meta GenAI and Corresponding author;
(3) Jialiang Wang, Meta GenAI;
(4) Licheng Yu, Meta GenAI;
(5) Kunpeng Li, Meta GenAI;
(6) Yinan Zhao, Meta GenAI;
(7) Ishan Misra, Meta GenAI;
(8) Jia-Bin Huang, Meta GenAI;
(9) Peizhao Zhang, Meta GenAI (Email: stzpz@meta.com);
(10) Peter Vajda, Meta GenAI (Email: vajdap@meta.com);
(11) Diana Marculescu, The University of Texas at Austin (Email: dianam@utexas.edu). Table of Links Abstract and Introduction
2. Related Work
3. Preliminary
4. FlowVid
4.1. Inflating image U-Net to accommodate video
4.2. Training with joint spatial-temporal conditions
4.3. Generation: edit the first frame then propagate


Experiments


5.1. Settings
5.2. Qualitative results
5.3. Quantitative results
5.4. Ablation study and 5.5. Limitations
Conclusion, Acknowledgments and References
A. Webpage Demo and B. Quantitative comparisons 5.4. Ablation study Condition combinations We study the four types of conditions in Figure 6(a): (I) Spatial controls: such as depth maps [38]. (II) Flow warped video: frames warped from the first frame using optical flow. (III) Flow occlusion: masks indicate which parts are occluded (marked as white). (IV) First frame. We evaluate combinations of these conditions in Figure 6(b), assessing their effectiveness by their winning rate against our full model which contains all four conditions. The spatial-only condition achieved a 9% winning rate, limited by its lack of temporal information. Including flow warped video significantly improved the winning rate to 38%, underscoring the importance of temporal guidance. We use gray pixels to indicate occluded areas, which might blend in with the original gray colors in the images. To avoid potential confusion, we further include a binary flow occlusion mask, which better helps the model to tell which part is occluded or not. The winning rate is further improved to 42%. Finally, we added the first frame condition to provide better texture guidance, particularly useful when the occlusion mask is large and few original pixels remain. Different control type: edge and depth We study two types of spatial conditions in our FlowVid: canny edge [5] and depth map [38]. Given an input frame as shown in Figure 7(a), the canny edge retains more details than the depth map, as seen from the eyes and mouth of the panda. The strength of spatial control would, in turn, affect the video editing. For style transfer prompt ’A Chinese ink painting of a panda eating bamboo’, as shown in Figure 7(c), the output of canny condition could keep the mouth of the panda in the right position while the depth condition would guess where the mouth is and result in an open mouth. The flexibility of the depth map, however, would be beneficial if we are doing object swap with prompt ’A koala eating bamboo’, as shown in Figure 7(d); the canny edge would put a pair of panda eyes on the face of the koala due to the strong control, while depth map would result in a better koala edit. During our evaluation, we found canny edge works better when we want to keep the structure of the input video as much as possible, such as stylization. The depth map works better if we have a larger scene change, such as an object swap, which requires more considerable editing flexibility. v**-prediction and** ϵ**-prediction** While ϵ-prediction is commonly used for parameterization in diffusion models, we found it may suffer from unnatural global color shifts across frames, as shown in Figure 8. Even though all these two methods use the same flow warped video, the ϵ-prediction introduces an unnatural grayer color. This phenomenon is also found in Imagen-Video [17]. 5.5. Limitations Although our FlowVid achieves significant performance, it does have some limitations. First, our FlowVid heavily relies on the first frame generation, which should be structurally aligned with the input frame. As shown in Figure 9(a), the edited first frame identifies the hind legs of the elephant as the front nose. The erroneous nose would propagate to the following frame and result in an unsatisfactory final prediction. The other challenge is when the camera or the object moves so fast that large occlusions occur. In this case, our model would guess, sometimes hallucinate, the missing blank regions. As shown in Figure 9(b), when the ballerina turns her body and head, the entire body part is masked out. Our model manages to handle the clothes but turns the back of the head into the front face, which would be confusing if displayed in a video. This paper is available on arxiv under CC 4.0 license. (1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: jeffliang@utexas.edu); (2) Bichen Wu, Meta GenAI and Corresponding author; (3) Jialiang Wang, Meta GenAI; (4) Licheng Yu, Meta GenAI; (5) Kunpeng Li, Meta GenAI; (6) Yinan Zhao, Meta GenAI; (7) Ishan Misra, Meta GenAI; (8) Jia-Bin Huang, Meta GenAI; (9) Peizhao Zhang, Meta GenAI (Email: stzpz@meta.com); (10) Peter Vajda, Meta GenAI (Email: vajdap@meta.com); (11) Diana Marculescu, The University of Texas at Austin (Email: dianam@utexas.edu). (1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: jeffliang@utexas.edu); (2) Bichen Wu, Meta GenAI and Corresponding author; (3) Jialiang Wang, Meta GenAI; (4) Licheng Yu, Meta GenAI; (5) Kunpeng Li, Meta GenAI; (6) Yinan Zhao, Meta GenAI; (7) Ishan Misra, Meta GenAI; (8) Jia-Bin Huang, Meta GenAI; (9) Peizhao Zhang, Meta GenAI (Email: stzpz@meta.com); (10) Peter Vajda, Meta GenAI (Email: vajdap@meta.com); (11) Diana Marculescu, The University of Texas at Austin (Email: dianam@utexas.edu). Table of Links Abstract and Introduction 2. Related Work 3. Preliminary 4. FlowVid 4.1. Inflating image U-Net to accommodate video 4.2. Training with joint spatial-temporal conditions 4.3. Generation: edit the first frame then propagate Experiments 5.1. Settings 5.2. Qualitative results 5.3. Quantitative results 5.4. Ablation study and 5.5. Limitations Conclusion, Acknowledgments and References A. Webpage Demo and B. Quantitative comparisons Abstract and Introduction Abstract and Introduction 2. Related Work 2. Related Work 3. Preliminary 3. Preliminary 4. FlowVid 4. FlowVid 4 .1. Inflating image U-Net to accommodate video .1. Inflating image U-Net to accommodate video 4.2. Training with joint spatial-temporal conditions 4.2. Training with joint spatial-temporal conditions 4.3. Generation: edit the first frame then propagate 4.3. Generation: edit the first frame then propagate Experiments Experiments Experiments 5.1. Settings 5.1. Settings 5.2. Qualitative results 5.2. Qualitative results 5.3. Quantitative results 5.3. Quantitative results 5.4. Ablation study and 5.5. Limitations 5.4. Ablation study and 5.5. Limitations Conclusion, Acknowledgments and References Conclusion, Acknowledgments and References A. Webpage Demo and B. Quantitative comparisons A. Webpage Demo and B. Quantitative comparisons 5.4. Ablation study Condition combinations We study the four types of conditions in Figure 6(a): (I) Spatial controls: such as depth maps [38]. (II) Flow warped video: frames warped from the first frame using optical flow. (III) Flow occlusion: masks indicate which parts are occluded (marked as white). (IV) First frame. We evaluate combinations of these conditions in Figure 6(b), assessing their effectiveness by their winning rate against our full model which contains all four conditions. The spatial-only condition achieved a 9% winning rate, limited by its lack of temporal information. Including flow warped video significantly improved the winning rate to 38%, underscoring the importance of temporal guidance. We use gray pixels to indicate occluded areas, which might blend in with the original gray colors in the images. To avoid potential confusion, we further include a binary flow occlusion mask, which better helps the model to tell which part is occluded or not. The winning rate is further improved to 42%. Finally, we added the first frame condition to provide better texture guidance, particularly useful when the occlusion mask is large and few original pixels remain. Condition combinations Different control type: edge and depth We study two types of spatial conditions in our FlowVid: canny edge [5] and depth map [38]. Given an input frame as shown in Figure 7(a), the canny edge retains more details than the depth map, as seen from the eyes and mouth of the panda. The strength of spatial control would, in turn, affect the video editing. For style transfer prompt ’A Chinese ink painting of a panda eating bamboo’, as shown in Figure 7(c), the output of canny condition could keep the mouth of the panda in the right position while the depth condition would guess where the mouth is and result in an open mouth. The flexibility of the depth map, however, would be beneficial if we are doing object swap with prompt ’A koala eating bamboo’, as shown in Figure 7(d); the canny edge would put a pair of panda eyes on the face of the koala due to the strong control, while depth map would result in a better koala edit. During our evaluation, we found canny edge works better when we want to keep the structure of the input video as much as possible, such as stylization. The depth map works better if we have a larger scene change, such as an object swap, which requires more considerable editing flexibility. Different control type: edge and depth v**-prediction and** ϵ**-prediction** While ϵ-prediction is commonly used for parameterization in diffusion models, we found it may suffer from unnatural global color shifts across frames, as shown in Figure 8. Even though all these two methods use the same flow warped video, the ϵ-prediction introduces an unnatural grayer color. This phenomenon is also found in Imagen-Video [17]. 5.5. Limitations Although our FlowVid achieves significant performance, it does have some limitations. First, our FlowVid heavily relies on the first frame generation, which should be structurally aligned with the input frame. As shown in Figure 9(a), the edited first frame identifies the hind legs of the elephant as the front nose. The erroneous nose would propagate to the following frame and result in an unsatisfactory final prediction. The other challenge is when the camera or the object moves so fast that large occlusions occur. In this case, our model would guess, sometimes hallucinate, the missing blank regions. As shown in Figure 9(b), when the ballerina turns her body and head, the entire body part is masked out. Our model manages to handle the clothes but turns the back of the head into the front face, which would be confusing if displayed in a video. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

FlowVid: Taming Imperfect Optical Flows: Conclusion, Acknowledgments and References

FlowVid: Taming Imperfect Optical Flows: Ablation Study and Limitations

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

Finding AI-Generated Faces in the Wild: Discussion, Acknowledgements, and References

Finding AI-Generated Faces in the Wild: Abstract and Intro

A Reference List to Learn More About Image Editing, Video Editing, and Diffusion Models

Finding AI-Generated Faces in the Wild: Model

Finding AI-Generated Faces in the Wild: Data sets

Finding AI-Generated Faces in the Wild: Results

Finding AI-Generated Faces in the Wild: Discussion, Acknowledgements, and References

Finding AI-Generated Faces in the Wild: Abstract and Intro

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps