2,588 reads

Why It’s Very Difficult to Create AI-Based Slow Motion

by [email protected]June 11th, 2020

Too Long; Didn't Read

AI-based frame interpolation algorithm works by studying two consecutive frames of footage and calculating an intermediary frame to place between them. The algorithm is powered by a model trained at length on many images, in order to learn how to create convincing transition frames. These extra frames can either be used for higher frame rate playback, or to create a non-jerky slow-motion effect that the original footage does not otherwise support. The process is repeated across the footage until the number of frames is doubled, tripled or even quadrupled as necessary.

People Mentioned

Companies Mentioned

Coin Mentioned

featured image - Why It’s Very Difficult to Create AI-Based Slow Motion

Over the last few years a number of open source machine learning
projects have emerged that are capable of raising the frame rate of
source video to 60 frames per second and beyond, producing a smoothed,
'hyper-real' look.

Footage from early 20thC New York, upscaled with ESRGAN (now
MMSR), colored by DeOldify and raised to 60fps by DAIN (Depth-Aware
Video Frame Interpolation) [1]

An AI-based frame interpolation algorithm works by studying two existing consecutive frames of footage and then calculating an intermediary frame to place between them. The algorithm is powered by a model trained at length on many images, in order to learn how to create convincing transition frames.

The DAIN workflow (credit: Site of founder Wenbo Bao)

The new interstitial frame that's created is an estimated 'halfway point' between the state of the two original frames, and will use its own reconstructions as key frames where necessary:

Above, the original available source frames from a scene in Casablanca
(1942). Below are the two original frames, plus the extra frames
interpolated by the machine learning algorithm (DAIN). These extra frames can either be used for higher frame rate playback, or to create a non-jerky slow-motion effect that the original footage does not otherwise support.

The process is repeated across the footage until the number of frames is doubled, tripled or even quadrupled as necessary.

In this way a 29.7fps movie can be reinterpreted at 60fps (2x), 120fps
(4x), or any multiplication that a playback device is capable of running, and that the software is capable of generating from limited frame information.

Using AI Frame Interpolation For Slow Motion
Frame rate upscaling is a relatively new application for media synthesis in machine learning. It's currently being used in hobbyist circles to improve the appearance of jerky animation [2], and to breathe new life into historical archive footage that was shot at the very low frame rates of early cinema (among other experiments). We can look at some more examples of these pursuits later.

However, we are under no obligation to actually run this new 60fps footage at 60fps. If we play the extended footage back at its original frame rate, we get an AI-generated 'slow-motion' effect instead:

Four seconds of Casablanca (1942) stretched out to 15 seconds of smooth slow-motion footage with the DAIN machine learning image synthesis algorithm.

The above clip was generated with DAIN-app [3], the Windows app version
of the DAIN (Depth-Aware Video Frame Interpolation) open source repository at GitHub [4].

If you choose the right clip, the effect can be dazzling; if you don't, you'll quickly discover the limitations of the process, most of which have nothing to do with the quality of the AI software, but rather with the nature of the challenge.

AI-Generated Slow Motion Has Trouble With Native Motion Blur In Original Footage
Most of the glitches that occur in the interpolation process pass too quickly to notice when played at 60fps. During slow motion playback, such defects are usually much more apparent.

In the following AI-slomo derived from Goodfellas (1990), all is well at the start; not much is moving near the camera's point of view, and the slow-motion effect is quite smooth, as if it had actually been shot at a high frame rate.

Then, however, the motion blur from De Niro's sudden movements reminds us that this is artificial rather than native slow-motion footage:

When De Niro leans down to pick up the cord, the movement is too quick and too long to be captured in detail at 24fps [5] with a standard shutter aperture and ordinary lighting. Therefore each individual frame of his head moving is blurred, like a time-lapse photo of car headlights at night [6].

It's not possible to automatically replace the detail that's lost to motion blur in original footage, and the machine learning algorithm will inevitably create a blurry interstitial frame between two frames that are already blurred in this way; it has no other information to work with.

Some sections in the Goodfellas clip have less movement, and act more like high quality slow motion in the style of Oxford Scientific Films (see video below). But these areas don't match up with the blur that De Niro's movements are generating, and so the illusion is compromised.

Conversely, it's difficult to shoot at high frame rates and then downsample it to anything resembling natural footage, because the lack of motion blur is jarring and incongruous.

For example, this footage, a show reel for Oxford Scientific Films, features a dog drinking in extreme slow-motion, and was shot at a very high frame rate:

If you remove enough frames to bring the footage nearer to real-time, the water droplets appear brittle and unnatural:

Image: Oxford Scientific Films - https://www.youtube.com/watch?v=jxQR0zyldYc

Thus the conditions for shooting natural and high speed slow-motion footage are mutually exclusive in most ways, to the extent that adding motion blur to real-life footage via third-party plugins and dedicated tools is an active sector in video editing and VFX.

Ironically, it is even more expensive to add motion blur to CGI [8][9][10][11], since the models must be rendered multiple times (in slightly different positions every time) for each frame of footage — a process known as sampling.

The more times you sample each frame, the richer and more convincing the motion blur will be.

Regarding AI-based frame interpolation, this leads us back to the problem we had with Robert De Niro in the Goodfellas clip: in this short AI-slomo from the conclusion of Jurassic Park, there is so much movement from T-Rex and his attackers that the slow-motion looks quite jerky, as if the effect had been achieved simply by slowing down the film:

The Growth of Motion Blur
Motion blur was a relative rarity in movies until the late 1960s, when the new wave of impoverished auteur film-makers were forced onto the streets and into natural lighting conditions, with limited access to expensive studio time.

Since high ASA stock was expensive and grainy at the time, the only way to get enough light onto the film was to open up the aperture and accept increased motion blur as one of the consequences.

Shooting in color in these circumstances was yet more difficult, since color film was even less receptive to light than the black and white stock that was out of fashion by the 1970s. But in the end, film-makers actually succeeded in fetishizing technical limitations such as motion blur and lens flare [12].

Classic films like Casablanca often yield better results for AI frame interpolation than later movies, because the crisp movement, limited location shooting and strong studio-based lighting usually produce sharper reference key-frames from which to fabricate intermediary frames.

The 'Skinny Shutter' Fashion
One newer cinematic fetish has bucked the 'motion blur' trend. When Steven Spielberg adopted a range of 'skinny shutter' angles to give a gritty and staccato look to the battle sequences in Saving Private Ryan (1998) [13], he created a fashion for action sequences that has yet to abate [14][15].

The narrower the shutter angle is in a movie camera, the more it 'freezes' a moving subject in each frame:

Illustration of how a 'Skinny Shutter' (fig.3) causes the kind of jerkiness seen in Saving Private Ryan (1998). Image is by Brendan H. Banks at the Red Forum.

However, all that jerky combat footage is incredibly suitable for AI frame interpolation, because it's practically free of motion blur:

The DAIN repository and the Windows DAIN-app are quite recent offerings. At the moment most of the user content being generated at the official Discord group [16] centers around up-scaling anime, gamer icons and other flat forms of animation.

Modest renders like that are popular because DAIN is such a resource hog; it requires the complete attention of a well-specced NVIDIA graphics card, and processing footage can take many hours, even for short clips. Running a full-length movie through DAIN is a distant prospect for most users.

In any case, AI frame interpolation is well-suited to 'staccato' source footage, including traditional cell animation…

…and for granting a little extra smoothness to the much loved output of legendary stop-motion animator Ray Harryhausen:

AI Slow-motion Can Have Trouble With Particles & Fluids, Fire, Hair, Explosions And Transitions
There are other inherent challenges for AI frame interpolation besides motion blur. The random nature of smoke, explosions, fire, fluids and fast-moving long hair can also cause some notable visual glitches in interpolated frames.

There is no default object recognition system in a frame interpolation AI that could help to distinguish these elements from any other collection of pixels, and the only remedy would seem to be to address the issues at the model training stage instead.

However, it is difficult to generate a model that's compact, well-optimized and generalized, yet still versatile enough to contend with unpredictable particle behavior.

In the next clip, we can see the waving hat of the boy disappearing into Robert De Niro's head or even disappearing entirely at certain points. The algorithm also has great trouble understanding how to handle the physics of the fast-moving fireworks in the background:

In this slomo of the 1961 eco sci-fi thriller The Day The Earth Caught Fire, the AI does its best to depict the flying water as expertly as it is doing with the many street revelers; but the action of the water is too fast, with too few frames:

When footage cuts abruptly from one shot to another, by default DAIN
treats the event as just another movement to interpret, leading to some
strange artifacts:

DAIN-app's GUI does offer access to a rudimentary scene detection feature, but it's buggy and unreliable at the moment (this is, after all, still alpha software). For now, it's easier to either process individual shots as discrete clips, or else edit out the strange transitions afterwards.

Tackling Occlusion In Frame Interpolation
Occlusion is an ongoing Druidic mystery [17][18] for researchers into image recognition and image synthesis. It's a strange, brief and annoyingly unpredictable event that is different nearly every time, and thus very difficult to cater for in a model's training goals.

Besides the obvious occlusion artifacts in the Godfather Part II clip, you may already have noticed that Robert De Niro's face becomes strangely 'detached' when it's framed by his arms and the electrical lead in the Goodfellas clip, and that in the Casablanca clip, the waiter's head appears on both sides of Bogart's head, briefly. Such 'hiding' glitches are a relatively frequent occurrence that can slip by on first viewing.

Occlusion errors when faces and objects become isolated. Bogart's waiter has a very deep head, while De Niro has his face distorted in two movies.

Difficulty With Large, Flat Areas Of Color
It's been reported [19] in the DAIN community that the software can
produce colored artifacts in a large expanse of sky quite randomly, in
much the same way that video codecs sometimes have difficulty [20]
compressing broad areas of color cleanly. I can confirm this:

Applications For Machine Learning Frame Interpolation
As has been observed from the industry [21], shooting genuine high-speed footage is the best route to effective and controllable slow motion effects.

Nonetheless, besides the obvious novelty of being able to do this kind
of thing with open source software, there are industry applications that can benefit from AI-based tweening, including in the animation [23] and games [24] industries, and for very high resolution slow motion videography in mobile devices, as well as for the reconstruction of damaged media [25].

Besides these possible outlets, frame interpolation is a core challenge [26] in the computer vision field, with collateral benefits for other sectors in media synthesis.

Using DAIN
DAIN can be downloaded from GitHub in a CLI implementation, with instructions here, or as DAIN-app, a Windows executable alpha release cycle that provides a GUI and an output console.

Running the installed DAIN-app calls up the console, which then launches the GUI. However, once a render is started, the GUI freezes, and can't be interacted with any longer (this is currently due to be fixed in the next version).

So it's the GUI that provides feedback on progress:

In the last line in the CLI image above, we can see the updates:

8%  | 	9/110  [03:42<38.18, 22.76s/it, file=0000000000008.png]

8% = Workload completion to date
9/110 = Current frame/total frames to render
03:42 = Time elapsed since start
38.18 = Estimated time to completion
22.76s/it = How long, in seconds, one iterative pass is taking
file=0000000000008.png = Name of current frame being processed

Estimating Render Time in DAIN
When a render starts, DAIN-app will take some minutes to render the initial small batch of frames, and only then will the console begin to provide estimates of how long the whole process will take.

These estimates may vary wildly depending on the difficulty of the most recent frames rendered. In general, the estimates are quite accurate - and, in most cases, shockingly long!

Importing a Clip Into DAIN
To start a new project, hit the 'Input File(s)' button and select your clip (until you know the capabilities of your system with DAIN, it's best to keep the clips to a few seconds in length).

Then click the 'Output folder' button and select a location for the render files and directories.

When the render starts, DAIN will create a templated folder structure within the output directory (though the 'output_videos' folder will not appear until the end of the rendering process):

'Interpolated Frames' - here, rather slowly, the rendered images will be deposited as a PNG dump once the render has begun:

Let's choose a frame-handling mode:

The choices are:

Mode 1: Default (all frames treated the same)
Mode 2: Remove duplicate frames (may alter animation speed)
Mode 3: Adaptive Record Timestamps then remove duplicate frames (won't alter animation speed)
Mode 4: Static Record timestamps then remove duplicate frames (won't alter animation speed).

Some of these choices, and several other options in DAIN-app, are concerned with processing flat, animation-based subjects, such as anime and very short animated icons.

For slow-motion work, whether for real-life or animated footage, choice 1 or 2 is usually fine.

Audio in DAIN

In a slow-motion project, the original source clip audio would be out of synch and quite useless in rendered output. The option to remux audio is unchecked in DAIN by default, so just leave it that way.

In any case, DAIN has no facilities to slow the audio down to match slow-motion output, never mind retaining the pitch.

Interpolation Options In DAIN

In the second tab, leave 'Depth Awareness Mode', 'Alpha Transparency' and 'Interpolation Algorithm' at their default settings.

The speedier 'experimental' second choice in 'Interpolation Algorithm' requires a very high amount of available VRAM (8-11gb), and produces more artifacts.

In the default 'Interpolation Algorithm' setting, the interpolation options are 2x, 4x and 8x.

This means that unless your source clip is exactly 15fps or 30fps, there is no way to obtain a 60fps frame-boost directly, because the 'Output FPS' field multiplies the source frame rate by fixed factors, and is not directly editable.

The 'Input FPS' field, is editable, and will initially reflect the native frame rate of your imported clip. Changing this figure is your main point of control for slow-motion output.

So, for standard frame rates, these are the output FPS multiplications that you can obtain with the 0.38 release of DAIN-app:

15 fps x 2 = 30 fps output
15 fps x 4 = 60 fps output
15 fps x 8 = 120 fps output

23.97 fps x 2 = 47.9 fps
23.97 fps x 4 = 95.9 fps
23.97 fps x 8 = 191.8 fps

25 fps x 2 = 50 fps
25 fps x 4 = 100 fps
25 fps x 8 = 200 fps

29.97 fps x 2 = 59.94 fps
29.97 fps x 4 = 119.88 fps
29.97 fps x 8 = 119.88 fps

Over in the third, 'Misc Options' tab, we can fix this, and get a clean 60fps output.

In the field 'If FPS exceeds this value, create another version with this fps', type '60', and then tick one or more of the options underneath:

1: '(If FPS exceeds [FPS] Create a [FPS] version of movie'

This will result in two rendered output clips: one at a 'strange' frame rate (i.e. 119.88), and one resampled exactly to the FPS value that you typed in, such as 60 fps; but it is not desirable to waste so much render time on such a large discrepancy between output fps and a lower target fps.

2: '(If FPS exceeds [FPS] Interpolate down to [FPS] [Conf.1:smooth]

Ticked by itself, this will produce one single downsampled output movie with the smoothest movement possible.

3: '(If FPS exceeds [FPS] Interpolate down to [FPS] (Conf 2: Sharp)

Ticked by itself, this will produce one single downsampled output movie with the crispest detail possible.

Of course, it is possible to manipulate a higher frame rate than 60 fps, if you don't need to distribute via a popular channel such as YouTube (which doesn't support user-uploaded videos with frame rates above 60 fps).

However, the main reason to allow a higher output frame rate than 60fps is to enable the optional downsampling to 60 fps in DAIN.

If you want more granular control over the frame rate, find a configuration that gives you the nearest available overshoot of 60 fps and just import the PNG files from the 'interpolated_frames' folder into a video editor, and then set the the image sequence to an interpretation of exactly 60 fps.

Whichever way you do it, some of the interpolated frames will be lost in the process.

Increased Frame Rates At Standard FPS speeds With DAIN

First, let's do what DAIN was designed for, primarily - upscaling the FPS of standard footage.

Here we see both the original and 60 fps interpolated version of a one-second clip from Things To Come (1936). To see the full effect, click the YouTube link in the bottom right and watch it in full-screen mode at YouTube:

Slow Motion With DAIN

So now let's try some slow-motion with DAIN. We'll import the Things To Come clip, and note that DAIN reports the clip's FPS as 23.976023976023978.

Now we can slow the clip down by a factor of two, by dividing 23.976023976023978 / 2 = 11.988011988

Hit the 'Perform all steps: Render' button, and then bide your time as best you can.

The clip we will obtain will run at the original frame rate (23.97 fps), but for twice the length:

For the next render, we'll divide the native frame rate by 4 (5.99)
and set the DAIN Interpolate drop-down to 8x. This will give us the maximum amount of frame interpolation possible for modest and mid-range NVIDIA GPUs.

We can either enjoy the extra smoothness of the fact that DAIN's inflexible FPS calculator has set the frame rate at 47.9 fps, or we can re-interpret the interpolated PNGs in an image editor as suits us and set a different frame rate for the output.

If we create a version to match the original clip's 23.97 fps, the slow-motion will be less fluid, but twice as long:

Otherwise the 47.9 fps clip that DAIN has made for us will be shorter but smoother:

Pausing and Resuming a DAIN Render Session

Transforming even a full source minute of video on DAIN can tie up your computer for quite a while, no matter how much VRAM you have available Luckily, it is possible to break such epic tasks into sessions in DAIN.

Though there is no elegant way of pausing a render so that it can be resumed later, at least in the alpha 0.38 release, you can accomplish this by selecting some text in the console and waiting a few minutes while DAIN finishes its current frame. The rendering process will then halt.

You can then either hit Esc and resume the render, or close the app and resume it later.

To resume a render, you will need to check the 'Resume Render' radio button near the top of the interface.

You'll now have access to the 'Render Folder' button, which launches an 'Open' dialog.

Navigate to your clip's work folder (created by DAIN as soon as the rendering began) and select and open its config.json file, which contains the settings for the paused render.

Load this into DAIN and just trust that all the settings have been restored, because, at the moment, there's no real feedback of that fact in the GUI.

Sample content of a config.json file:

{"alphaMethod": 0, "audioVersion": 0, "checkSceneChanges": 1, "cleanCudaCache": 1, "cleanInterpol": 0, "cleanOriginal": 1, "debugKeepDuplicates": 0, "doIntepolation": 1, "doOriginal": 1, "doVideo": 1, "exportPng": 0, "fillMissingOriginal": 0, "fps": 23.975806231893273, "framerateConf": 2, "inputMethod": 1, "inputType": 1, "interpolationAlgorithm": 0, "interpolationMethod": 0, "loop": 0, "maxResc": 0, "model": "./model_weights/best.pth", "onlyRenderMissing": 0, "outFolder": "C:/Users/UserName/Desktop/DAIN/Project/___001 Slow Motion article/Godfather PT II/The Godfather Clip 000001 De Niro Walks Against Fireworks/", "outStr": "C:/Users/UserName/Desktop/DAIN/Project/___001 Slow Motion article/Godfather PT II/The Godfather Clip 000001 De Niro Walks Against Fireworks//The Godfather Clip 000001 De Niro Walks Against Fireworks.mp4", "palette": 0, "quiet": 0, "resc": 0, "sceneChangeSensibility": 10, "splitFrames": 1, "splitPad": 130, "splitSize": 370, "use60": 1, "use60C1": 0, "use60C2": 0, "use60RealFps": 60, "useAnimationMethod": 0, "useWatermark": 0, "video": "C:/Users/UserName/Desktop/DAIN/Project/___001 Slow Motion article/Godfather PT II/The Godfather Clip 000001 De Niro Walks Against Fireworks.mp4"}

Hit the 'Step 2: Feed Source Frames to DAIN' button (DAIN has already extracted the PNGs, so step 1 is redundant):

The render will continue.

When the interpolations are complete, you'll have to hit the 'Step 3: Convert DAIN frames to video' button in order to get your assembled clip/s.

Avoiding Out Of Memory (OOM) Errors in DAIN

Only the highest-spec graphics cards are going to waltz through a 1080p interpolation session on DAIN. The rest of us will OOM real soon after first installing the program.

Luckily, DAIN provides memory management workarounds for lower-spec VRAAm allowances:

The two options in the 'Fix OutOfMemory Options' tab are 'Downscale video' and 'Split frames into sections'.

The first option is not that popular in the DAIN community, and also is better handled by resizing the clip in a dedicated program prior to a DAIN session.

The second option is the Frame Splitter, which cuts each frame into a division of your choice, and then reassembles the processed chunks into a single frame before moving on to the next, using overlap render areas to ensure smooth joins (see 'Section Padding' below).

Credit: The official DAIN-app Discord server

DAIN recommends that you set clean and even splits, but doesn't provide any automated split method based on the reported resolution of the imported clip.

Instead you must type in a single Section Size figure, for example 1280 / 4 = 320. So you would overwrite the default '480' with '320'.

In practice, trying to implement a totally even split is a bit of a pain in the neck, and sometimes pointless. For instance, I established by trial and error that the maximum split size that my computer can handle before OOM is 370, which is higher than the 320 that makes a perfect 4-way split out of 1280.

In either case in the above illustration, DAIN will not need to handle more than 4 rendering events per frame, and it may even handle the second one a little more quickly.

Hopefully DAIN will eventually get a dedicated optimal split calculation tool. In the meantime, the official template for splitting a 1920x1080 clip is:

1080 (1x resolution height)
960 ( 1/2 res width)
640 (1/3 res width)
540 ( 1/2 res height)
480 ( 1/4 res width)
384 ( 1/5 res width)
360 ( 1/3 res height)
320 ( 1/6 res width)
270 ( 1/4 res height)
240 ( 1/8 res width)
216 ( 1/5 res height)
192 ( 1/10 res width)
180 ( 1/6 res height)
160 ( 1/12 res width)
135 ( 1/8 res height)
128 ( 1/15 res width)

Section Padding

The second max value you'll need to establish by trial and error is Section Padding. Section Padding creates an overlap border of render area for each slice, so that the joins are less visible after assembly.

The lower this value is, the more you're going to see video artifacts where the slices reveal themselves - particularly with fast-moving action in the video. So try the default 150 and work your way down below the OOM threshold, but still prioritizing the Section Size.

General Requirements for DAIN-app

At the time of writing, DAIN-app was on version 0.38, with some notable improvements in the pipeline.

Besides Windows 10, DAIN requires a 5.0+ CUDA GPU with a compute rating of at least 6.1. Minimum recommended VRAM is 8gb.

The original DAIN GitHub repository requires Ubuntu, while there is some hope that a more generic Windows 10 app could be developed via Docker.

The examples in this post were rendered on an Asus UX550GX ZenBook laptop with 16gb RAM, Intel Core i7-8750H processor, and a NVIDIA GeForce GTX 1050ti with a mere 4gb of GDDR5 SDRAM (VRAM). On this modest set-up, I was able to multiply 1.25 seconds of 24fps 720p video by a factor of 8 every hour.

For a 4-second clip, that's about five hours of processing time.

According to the official DAIN Discord community, these are the render times and frame splitting ranges you can expect, according to your hardware:

8GB of VRAM

1080p : 540~640 / 200
720p : 360~640 / 200
640p and below : no need

6GB of VRAM

1080p : 384 / 150
720p : 360 / 150
480p: No need

4GB of VRAM

1080p : ??????
720p : 360~384 / 100
480p: 480 / 150
360p: No need

VRAM estimates (when not using the frame splitter)

360p Uses 2 to 4 GB VRAM
480p Uses 5 to 6 GB VRAM
720p Uses 10~11 GB VRAM
1080p Uses 18~19 GB VRAM

______________________________________________________________________________

This is an extended version of an article first published on LinkedIn on 15th May 2020.

Further Resources

Depth-Aware Video Frame Interpolation (PDF)

The founding paper from the initial developers of DAIN, detailing the methodology and concepts behind the system.

DAIN-APP Playground – YouTube playlist

The originators of DAIN have assembled a really terrific collection of 169 videos around image synthesis and interpolation.

A Comprehensive Guide To The State-Of-The-Art of Machine Learning For The Visual Effects Industry

Another VFX article by me. This one covers deepfakes, environment and object generation, rotoscoping, motion capture, pose estimation, color grading, in-betweening for animation and match-move, normal and depth map generation, and the road ahead for the automated VFX pipeline.

Research Guide for Video Frame Interpolation with Deep Learning

An in-depth look at frame interpolation from September 2019, covering Optical Flow Estimation, Deep Voxel Flow, a Bidirectional Predictive Network, PhaseNet, Super SloMo, Depth-Aware Video Frame Interpolation
and GAN approaches, among other aspects.

Video Frame Interpolation and Extrapolation (PDF)

A fascinating academic insight into an autoencoder-based interpolation model from Stanford University.

Super slow motion effect for free

Guide to installing and using the Super-SloMo deep learning framework[i].

MMSR (GitHub)

PyTorch-based open source framework with interpolation capabilities from folding in the ESRGAN project.

NVIDIA

Deep Learning For Slow Motion Video (PDF)

From the current industry leaders in the field, this slide-based PDF plays out the practical challenges of frame interpolation.

Research at NVIDIA: Transforming Standard Video Into Slow Motion with AI

The sensational 2018 NVIDIA show-reel for the company's research into AI-assisted interpolation.

_______________________________________________