The phenomenon of AI-powered face-swapping seemed set to corrupt, deceive and threaten society in terrible ways and across multiple sectors.
The last time I reviewed the state of the art in AI VFX, deepfakes technology was barely a year into existence. It was predicted, even at the state level, that deepfake output would eventually become indistinguishable from reality.
At a few years' sober distance, it could now be argued that deepfake videos– at least as far as the term refers to the porn-centric DeepFaceLab and the slightly more sober FaceSwap project – are set for as truncated an entry in the history of image synthesis as the fax machine holds in the history of communications: they're showing the way, but they're not fit to finish the journey.
Hyper-realistic face swapping may well be coming - but it's probably coming from elsewhere.
That's not to say that the consumer-facing architectures of DFL and FaceSwap couldn't be deconstructed, adapted and re-tooled for professional VFX purposes, or their core technologies re-imagined in more capable work-flows; but rather that these distributions (both of which are forks of the original and now-abandoned 2017 GitHub code) are off-the-shelf solutions aimed at hobbyists with access to gaming-level, rather than production-level GPU and pipeline resources; and that for these and other reasons (as we'll see), the core code cannot meet the general public expectation that deepfakes' quality will improve exponentially in time.
DeepFaceLab (left) seems to currently be the popular choice in
production environments, though it features a wizard-driven approach via BAT files. By contrast, FaceSwap (right) has an integrated GUI.
Later, we'll hear from a leading VFX industry researcher about some of the technical bottlenecks and architectural constraints that prevent current deepfake approaches from achieving greater realism. We'll also take a look at some initiatives that seek to extend or re-imagine new approaches to face simulation.
First, to understand the limitations of deepfake technology, we need to have a basic understanding of the process - so let's take a look at that.
A deepfake from March 2021 by TikTok user NextFace, which inserts modern singer/songwriter Billie Eilish into the 1944 Judy Garland musical 'Meet Me In St. Louis'. (https://www.tiktok.com/@nextface/video/6936881472313806085)
In a deepfakes workflow, a face-set is extracted from video clips for both the source subject (the person you wish to replace) and the target subject (the person you wish to eventually appear substituted into the video footage). The average number of images in the resultant face-set is 5-10,000.
The deepfake software, in this case FaceSwap, traverses video clips looking for faces – any faces – and saves those it finds into a face-set.
Though the software can be guided to look for a particular face based on a reference image, this functionality does not work well enough to be useful. Instead the user must weed out not only the non-target faces found by the facial recognition algorithm, but also the inevitable false positives that the program dumps into the dataset.
The clip from the 1944 Judy Garland movie contains many people, and here the software has dumped them all into the face-set indiscriminately.
The facial identification algorithm can be set to varying levels of sensitivity, but in nearly all cases it gets fooled by non-human patterns, textures and shapes, leading to many total mismatches during the extraction process.
The software can sort the results by face, testing the similarity of the facial landmarks until the face candidates are grouped quite effectively, enabling the quick deletion of non-relevant matches.
Python sorts the jumble of faces by identity similarity.
With the faces sorted by identity, it’s easier to remove non-relevant matches.
The program uses the FAN facial landmark detection system developed by research scientist Adrian Bulat.
The alignments will frequently need adjusting for profile shots, or for acute angles.
FAN align usually performs poorly on profile shots and acute angles. Therefore it's necessary to intervene on all affected frames manually, which can be labor-intensive for longer clips.
Nearby faces are also an issue:
When FAN align identifies two faces close to each other, it frequently 'hedges its bets' about whether the mouth of one might belong to one face or another, leading to grotesque alignments.
The masking feature follows the path of the alignment landmarks, but does not account for obstructing objects. Therefore masking/occlusion algorithms have been manually trained over time by the FaceSwap community and integrated into the package (whereas DFL features trainable masks, which are effective, but much less user-friendly).
Where the occlusion masking fails, manual clean-up of each frame is necessary.
After this, the curated source and target face-sets are trained on an encoder/decoder architecture, and not (as is popularly reported, a GAN model. However DFL now features an optional GAN element for enhancement, though many users complain of the grain that it can generate.
During the training, each face-set's pixels are abstracted into mathematical parameters and facets in the 'latent space' of a shared autoencoder, which proceeds, at length — sometimes for weeks — to take apart the facial identities and learn how to reconstruct them, constantly grading its own success on the latest attempt, and learning from the scores (i.e. the estimated loss rates) that it assigns to the results of these attempts.
Unlike the shared autoencoder space, the decoder units (on the right in the image below) are dedicated to only one of the two specific identities; but during training the decoders have inherited the learned transformational mappings of both identities, and can be rewired to use information from identity A in order to generate images of identity B.
Note the way that the identity information can 'cross over', to the right of the latent space.
The initial ten minutes of training, as the model first learns the
basic lineaments of the two subjects, and begins the long task of
calculating better computed paths between the features of each subject.
Though all of the available model variations for FOSS deepfake software are derived (however distantly) from the original 2017 code, some model types produce higher resolution output, but require more available GPU VRAM.
It is possible to use more than one GPU simultaneously for training, but with diminishing returns, because, according to the developers, the only benefit of parallel multi-GPU training is increased batch sizes (the number of images processed at any one time in the latent space of the model) – and at very high batch sizes, the developers contend, the model becomes worse at learning details, sabotaging the intent.
Some of the more sophisticated models have dozens of potential parameters, but experimenting with them is time-consuming, and the results will likely not be consistent across different datasets and projects.
Minor changes in any of the settings can add quality, reduce training time -- or result in days of lost effort, since experimentation is so time-consuming, and good results from one setting may not be transferable to the demands of different face-sets in subsequent projects.
As the 'predicted accuracy' loss rates descend during training, the swaps become more authentic. However, loss rates don’t directly signify success, and the user must evaluate for themselves, from the quality of the previews and through periodic test-swaps, whether the model is sufficiently trained.
At that point, the trained algorithms and the alignments and masks that were produced at the start can be used, at last, to superimpose a realistic 'target' face onto the source video clip (see earlier image of the Garland/Eilish swap).
The most common haunts for non-porn deepfakes are ads, art projects, social satire and indie curiosities – as well as the Sassy Justice channel from South Park creators Trey Parker, and Matt Stone (and actor Peter Serafinowicz), which mocked Donald Trump in a series of well-rendered deepfake skits immediately prior to the 2020 US election.
As regards mainstream releases for FOSS deepfake output, Framestore's chief creative officer Tim Webber told the FT in 2019 that deepfake software provided 'a few seconds' of a de-aged Bill Nighy in that year's Pokémon Detective Pikachu, though this (apparently) official debut of FOSS deepfake software in movie production went largely unnoticed – arguably a compliment to the technique.
Or not. "It required some fixes," Framestore Executive Creative Director William Bartlett said to VFXVoice, "but deep fakes suited this need because of the nature of what would ultimately be presented on screen."
Additionally, Terminator: Dark Fate director Tim Miller was reported to feel that some of the CG actor renderings for the movie (released four months after Pokémon Detective Pikachu) needed 'something extra', and directed the VFX department to use deepfake software on some existing CGI head replacements.
In 2020 a more extensive use of open source deepfake software emerged, when documentarian David France sought volunteers to 'donate' their faces as proxies for the at-risk subjects interviewed for Welcome to Chechnya.
‘Welcome To Chechnya’ (2020) used a home-spun deepfake algorithm to
protect the identities of the at-risk subjects featured. Seven ‘face
donors’ volunteered their likenesses for the project.
Ryan Laney, the visual effects producer for the documentary, told me that the custom framework that was developed for the project as 'an encryption key for the subjects identities', though he will not release extensive details of it.
From the moment that deepfake technology came to the attention of the world via a (now shuttered) Reddit group late in 2017, any headlines related to it that were not about porn or politics centered on the comparative shortcomings of Hollywood's own attempts at face simulation, and lionized the latest and greatest effort from a new tranche of celebrity deepfakers, such as Ctrl Shift Face and Belgian VFX artist and deepfaker Christopher Ume.
At the time of writing, one of the most recent deepfake sensations was a collaboration between Ume and Tom Cruise impersonator Miles Fisher, which exceeded 11 million views on TikTok, among other platforms.
During the video's publicity run, NBC cited an image forensics professor at Berkeley as saying that the fake Cruise video is 'a big step forward in the development of this technology'.
However, notwithstanding what is clearly a phenomenal data-gathering, curation and compositional effort on the part of Ume, the DFL/FaceSwap workflow still shows 'the usual suspects', at this level of scrutiny, due to limitations that are beyond any user's control (click on picture below to take a closer look):
In the first image, there's a kind of demilitarized zone between the hairline and forehead, where the skin is unnaturally smoothed out. The wrinkles themselves are over-simplified and sketchy, and, in the second picture, the masking algorithm has failed to perform occlusion on the sunglasses as they come off 'Cruise's' face.
In the third image, the lineaments of the alpha channel for the face fails at one of deepfakes' toughest hurdles – the jawline. It also breaks continuity around the cheekbone area.
Where the problem is textural realism (rather than compositing issues), what can be done to improve output? Besides getting lucky in the current VRAM famine, and throwing more Tensor cores at the problem, actually not much.
"To make deepfakes work for production," says Dan Ring, "training times need to be significantly faster to meet delivery deadlines. But it's difficult to improve the performance of deepfakes architecture by throwing resources at it."
Based in Dublin, Ring is Head of Research for UK visual effects software company Foundry, with a deep interest in the potential of machine learning, and some experience of trialing FOSS deepfake software to establish its limits and potential.
"Hooking up a bank of Tesla V100s for a deepfakes set-up may get you faster training," he says, "but as the resolution of the input imagery and complexity of the model increases, you start to hit new bottlenecks.
"With that kind of parallel processing across GPUs in multiple machines, it's like sending ten smart people to solve a very hard problem and only letting them meet once a week. Although there's been some impressive work in this [regard], currently there's just no practical orchestration mechanism to make that work for production."
Ring says that ideally one would need a single latent space operating on a GPU that is faster and has higher levels of VRAM than anything that's on the market now, in order to push the boundaries of what's currently possible in the standard open source deepfake packages.
"This means that to reach the quality level needed for VFX today, you have to be clever, making choices about what kind of angles and expressions to prioritize."
Ring also notes that that there isn't a 1:1 relationship between the resolution of the input images and the plausibility of the output image.
"At some point," he says, "you become limited by the model size, complexity or architecture. This means additional post-processing is needed to raise the fidelity, either by an artist's hand, or more likely by a secondary network trained to remove errors in the performance transfer and enhance facial detail."
Machine learning is reported to play a significant role in Digital Domain's mysterious Charlatan system, which was employed to bring legendary NFL coach Vince Lombardi back to life for a spot at Super Bowl LV, and to add decades to David Beckham for an appeal to solve the world's malaria crisis.
In research released in the summer of 2020, Disney Research Labs seems to have overcome the quality limits on the FOSS packages by abstracting the core principles of deepfake architecture into an original new system, and adding progressive training.
(https://www.youtube.com/watch?v=yji0t6KS7Qo)
The system can train more than two facial identities simultaneously, and is capable of delivering megapixel output, with radically improved temporal stability.
The new architecture, called High Resolution Neural Face Swapping (HRNFS) centers on a domain-transfer approach, wherein images from any number of identities in a single database are embedded into the shared latent space via a common encoder, and mapped back into pixel space using the decoder that's particular to whichever identity is desired for the swap.
The visualized design of the network inspired Disney's researchers to call this a 'comb network', wherein each identity-locked decoder is one of the 'teeth'.
The fact that the system can train multiple identities without pairing them to any of the others makes for a drastic reduction in training time, according to the researchers.
Swaps are much better from a Disney model trained on eight identities (right) rather than two (middle).
The reported training experiments uses the Adam optimizer training a 1024×1024 network for two identities. The training took three days on a single GeForce 1080Ti GPU. The 1080ti is considered a 'mid-level' consumer card in the deepfake community, which is more used to measuring training time in weeks, and which can train only two identities at a time, rather than the multiple identities that Disney's network can.
It could be argued that Disney's innovation is equivalent to creating an 'iron horse' rather than inventing the combustion engine, since it improves upon existing conceptual pipelines instead of clearing the map for a novel solution to the problem. The comparison tests published in the research paper show some of the same problems detailed earlier, such as unconvincing wrinkles in older face subjects.
Comparison of Disney's new architecture to the original Deepfakes code, DFL, and the method outlined in 2017 in a study from the Institute For Robotics And Intelligent Systems at USC.
We'll have to wait for production deployment (i.e., use in a movie) to see whether or not HRNFS represents a quantum leap in AI-based identity synthesis, or just a more efficient way to race towards the stubborn limits of deepfakes' data architecture.
However, the use of multiple trained identities to improve swap-quality for a single identity is a major advance, within those constraints.
'When amateur deepfakes on YouTube rival movie VFX face-replacements, the directors and studios start to ask about AI tech!'
So says the academy award-winning visual effects supervisor Scott Stokdyk, also the VFX lead on Tom Hanks' AI outing BIOS (2021).
'For now,' he told me. 'it seems that the deep fake work cannot get to 100% quality consistently on every shot, and that there is no current way with that methodology to get from 90% to 100%. It can be part of the toolset, but shouldn't be relied on for an entire project that has many different scenarios.'
But Stokdyk also envisages a rather different application of DeepFake technology than the cinematic proliferation we've been waiting for in vain the last three years.
'One area that interests me,' he said. 'is that the same technology that is starting to be used to detect deep fakes should also be repurposed to detect more general synthetic qualities in VFX images.
'These machine learning algorithms can look at image histograms or different representations of images (i.e. in the frequency or HSV domains) and should be able to nudge the images toward [being] more photoreal. It is still difficult in many situations to have all-CG VFX look real, and I think machine learning will help with this problem.'
Not every amazing new technology develops consistently, and some don't really develop at all: the combustion engine; the incandescent light bulb; the refrigerator; and the toilet, to name but a few of those innovations that made only a single evolutionary leap.
The current architecture of popular deepfake software (and in most professional use cases, that's DeepFaceLab, even if VFX producers play the fact down, perhaps to avoid being associated with deepfake porn) has changed very little in the three years that it's been fueling catastrophic headlines. What has changed is that deepfake practitioners have had years to thoroughly scope the limits of the process, and to develop tortuous workarounds for the endless blind alleys of the software. Likewise the VFX industry.
So is something better brewing in the labs of the world's leading visual effects houses? All the VFX producers I talked to in the course of writing this article (not all of which are named here) who were prepared to admit which software they were using told me that it was DeepFaceLab - the same free package used to make Tom Cruise viral fakes and deepfake porn - though admittedly their dataset acquisition and curation resources are far in advance of most amateur users.
It's possible that truly powerful identity substitution will eventually be made possible by machine learning methods that are far more rudimentary at the moment, such as AI-generated image synthesis; and that when this comes to pass, with the power to recreate entire people instead of hunting round for 'body donors' and 'similar' faces - yes, our faith in the moving image as a token of truth will be truly threatened; and that we'll remember the FOSS deepfakes era with the same amusement as the doomed VR boom of the early 1990s - as an early preview of a technology whose time had not yet come.
Martin Anderson is a freelance writer on machine learning and AI.
This article was originally published at https://rossdawson.com/futurist/implications-of-ai/the-limited-future-of-deepfakes/