paint-brush
FlowVid: Taming Imperfect Optical Flows: Ablation Study and Limitationsby@kinetograph

FlowVid: Taming Imperfect Optical Flows: Ablation Study and Limitations

by Kinetograph: The Video Editing Technology Publication
Kinetograph: The Video Editing Technology Publication HackerNoon profile picture

Kinetograph: The Video Editing Technology Publication

@kinetograph

The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover...

October 9th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video.
featured image - FlowVid: Taming Imperfect Optical Flows: Ablation Study and Limitations
1x
Read by Dr. One voice-avatar

Listen to this story

Kinetograph: The Video Editing Technology Publication HackerNoon profile picture
Kinetograph: The Video Editing Technology Publication

Kinetograph: The Video Editing Technology Publication

@kinetograph

The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover cutting edge tech for video editing.

Learn More
LEARN MORE ABOUT @KINETOGRAPH'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

(1) Feng Liang, The University of Texas at Austin and Work partially done during an internship at Meta GenAI (Email: jeffliang@utexas.edu);

(2) Bichen Wu, Meta GenAI and Corresponding author;

(3) Jialiang Wang, Meta GenAI;

(4) Licheng Yu, Meta GenAI;

(5) Kunpeng Li, Meta GenAI;

(6) Yinan Zhao, Meta GenAI;

(7) Ishan Misra, Meta GenAI;

(8) Jia-Bin Huang, Meta GenAI;

(9) Peizhao Zhang, Meta GenAI (Email: stzpz@meta.com);

(10) Peter Vajda, Meta GenAI (Email: vajdap@meta.com);

(11) Diana Marculescu, The University of Texas at Austin (Email: dianam@utexas.edu).

5.4. Ablation study

Condition combinations We study the four types of conditions in Figure 6(a): (I) Spatial controls: such as depth maps [38]. (II) Flow warped video: frames warped from the first frame using optical flow. (III) Flow occlusion: masks indicate which parts are occluded (marked as white). (IV) First frame. We evaluate combinations of these conditions in Figure 6(b), assessing their effectiveness by their winning rate against our full model which contains all four conditions. The spatial-only condition achieved a 9% winning rate, limited by its lack of temporal information. Including flow warped video significantly improved the winning rate to 38%, underscoring the importance of temporal guidance. We use gray pixels to indicate occluded areas, which might blend in with the original gray colors in the images. To avoid potential confusion, we further include a binary flow occlusion mask, which better helps the model to tell which part is occluded or not. The winning rate is further improved to 42%. Finally, we added the first frame condition to provide better texture guidance, particularly useful when the occlusion mask is large and few original pixels remain.


image

image


Different control type: edge and depth We study two types of spatial conditions in our FlowVid: canny edge [5] and depth map [38]. Given an input frame as shown in Figure 7(a), the canny edge retains more details than the depth map, as seen from the eyes and mouth of the panda. The strength of spatial control would, in turn, affect the video editing. For style transfer prompt ’A Chinese ink painting of a panda eating bamboo’, as shown in Figure 7(c), the output of canny condition could keep the mouth of the panda in the right position while the depth condition would guess where the mouth is and result in an open mouth. The flexibility of the depth map, however, would be beneficial if we are doing object swap with prompt ’A koala eating bamboo’, as shown in Figure 7(d); the canny edge would put a pair of panda eyes on the face of the koala due to the strong control, while depth map would result in a better koala edit. During our evaluation, we found canny edge works better when we want to keep the structure of the input video as much as possible, such as stylization. The depth map works better if we have a larger scene change, such as an object swap, which requires more considerable editing flexibility.


image


v**-prediction and** ϵ**-prediction** While ϵ-prediction is commonly used for parameterization in diffusion models, we found it may suffer from unnatural global color shifts across frames, as shown in Figure 8. Even though all these two methods use the same flow warped video, the ϵ-prediction introduces an unnatural grayer color. This phenomenon is also found in Imagen-Video [17].

5.5. Limitations

Although our FlowVid achieves significant performance, it does have some limitations. First, our FlowVid heavily relies on the first frame generation, which should be structurally aligned with the input frame. As shown in Figure 9(a), the edited first frame identifies the hind legs of the elephant as the front nose. The erroneous nose would propagate to the following frame and result in an unsatisfactory final prediction. The other challenge is when the camera or the object moves so fast that large occlusions occur. In this case, our model would guess, sometimes hallucinate, the missing blank regions. As shown in Figure 9(b), when the ballerina turns her body and head, the entire body part is masked out. Our model manages to handle the clothes but turns the back of the head into the front face, which would be confusing if displayed in a video.



This paper is available on arxiv under CC 4.0 license.


L O A D I N G
. . . comments & more!

About Author

Kinetograph: The Video Editing Technology Publication HackerNoon profile picture
Kinetograph: The Video Editing Technology Publication@kinetograph
The Kinetograph's the 1st motion-picture camera. At Kinetograph.Tech, we cover cutting edge tech for video editing.

TOPICS

THIS ARTICLE WAS FEATURED IN...

Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X
Kinetograph
X REMOVE AD