This new Facebook AI model can translate or edit any text in an image in your language of choice. Not only that, the model can ensure that that the translated text follows the same font and style of the original image.
Learn more about it in the video.
►Read the full article: https://www.louisbouchard.ai/textstylebrush/
►Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal
Hassner, Facebook AI, (2021), "TextStyleBrush: Transfer of text
aesthetics from a single example",
https://scontent.fymq3-1.fna.fbcdn.net/v/t39.8562-6/10000000_944085403038430_3779849959048683283_n.pdf?_nc_cat=108&ccb=1-3&_nc_sid=ae5e01&_nc_ohc=Jcq0m5jBvK8AX9p0hND&_nc_ht=scontent.fymq3-1.fna&oh=ab1cc3f244468ca196c76b81a299ffa1&oe=60EF2B81
►Dataset Facebook AI made:
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset?fbclid=IwAR0pRAxhf8Vg-5H3fA0BEaRrMeD21HfoCJ-so8V0qmWK7Ub21dvy_jqgiVo
00:00
Imagine you are on vacation in another country where you do not speak the language.
00:05
You want to try out a local restaurant, but their menu is in the language you don't speak.
00:09
I think this won't be too hard to imagine as most of us already faced this situation.
00:14
Whether you see menu items or directions and you can't understand what's written.
00:19
Well, in 2020, you would take out your phone and google translate what you see.
00:24
In 2021 you don't even need to open google translate anymore and try to write what you
00:29
see one by one to translate it.
00:31
Instead, you can simply use this new model by Facebook AI to translate every text in
00:36
the image in your own language!
00:38
Of course, as you can see here, this is not the first application of this technology,
00:43
but even this is cool.
00:45
What is even cooler is that their translation tool actually uses similar technology as deep
00:50
fakes to change the words in an image following the same style as the original words!
00:55
It can copy the style of a text from any picture using a single word as an example!
01:01
Just like this...
01:02
This is amazing for photo-realistic language translation in augmented reality.
01:06
This is only the first paper trained on a new dataset they released for this task, and
01:12
it is already quite impressive!
01:14
This could be amazing for video games or movies as you will be able to translate the text
01:19
appearing on buildings, posters, signs, etc. super easily,
01:23
making the immersion even more personalized and convincing for everyone based on the chosen
01:29
language without having to manually photoshop each frame or completely remake scenes.
01:34
As you can see, it also works with handwriting using a single word as well.
01:39
Its ability to generalize from a single word example and copy its style is what makes this
01:43
new artificial intelligence model so impressive.
01:46
Indeed, it understands not only the typography and calligraphy of the text, but also the
01:50
scene in which it appears.
01:53
Whether it's on a curvy poster or different backgrounds.
01:56
Typical text-transfer models are trained in a supervised manner with one specific style
02:01
and use images with text segmentation.
02:03
Meaning that you need to know what is every pixel in the picture, whether it is the text
02:08
or not, which is very costly and complicated to have.
02:11
Instead, they use a self-supervised training process where the style and the segmentation
02:15
of the texts aren't given to the model during training.
02:18
Only the actual word content is given.
02:21
I said that they released a dataset for this model and that it was able to do that with
02:26
only one word.
02:27
This is because the model first learns a generalized way of accomplishing the task on this new
02:32
dataset with many examples during training.
02:35
This dataset contains approximately 9 000 images of text on different surfaces with
02:40
only the word annotations.
02:42
Then, it uses the new word from the input image to learn its style in what we call a
02:47
"one-shot-transfer" manner.
02:49
This means that from only one image example containing the word to be changed, it will
02:54
automatically adjust the model to fit this exact style for any other words.
02:59
As you know, the goal here is to disentangle the content of a text appearing on an image
03:04
and then to use this text's style on new text and put it back on the image.
03:09
This process of disentangling the text from the actual image is learned in a self-supervised
03:14
manner, as we will see in a minute.
03:16
In short, we take an image as input and create a new image with only the text translated.
03:22
Doesn't it feel similar to the task of taking a picture of your face and only change specific
03:26
features of it to match another style, like the video I just covered did on hairstyles?
03:31
If you remember, I said that it is very similar to how deepfakes work.
03:36
Which means that what would be better to do this than StyleGan2, the best model for generating
03:42
images from another image?
03:44
Now, let's get into how it can achieve this, which means the training process.
03:49
They train this model to measure its performance on these unlabeled images using a pre-trained
03:55
typeface classification network and a pre-trained text recognition network.
04:00
This is why it is learning in a self-supervised manner because it doesn't have access to labels
04:05
or ground truth about the input images directly.
04:08
This, coupled with a realism measure calculated on the generated image with the new text compared
04:14
to the input image, allows the model to be trained without supervision
04:18
where we tell it exactly what is in the image, aiming for photo-realistic and accurate text
04:21
results.
04:22
Both these networks will tell how close the generated text is from what it is supposed
04:27
to be by first detecting the text in the image, which will be our ground truth, and then comparing
04:33
the new text with what we wanted to write and its font with the original image's text
04:39
font.
04:40
Using these two already-trained networks allows the StyleGan-based image generator to be trained
04:45
on images without any prior labels.
04:48
Then, the model can be used at inference time, or in other words, in the real world, on any
04:53
image without the two other networks we discussed, only sending the image through the trained
04:59
StyleGAN-based network which generates the new image with the modified text.
05:05
It will achieve its translation by understanding the style of the text and the content separately.
05:10
Where the style is from the actual image, and the content is the identified string and
05:15
the string to be generated.
05:16
Here, the "understanding" process I just referred to is an encoder for each, here shown in green,
05:23
compressing the information into general information that should accurately represent what we really
05:28
want from this input.
05:30
Then, both these encoded, general information are sent in the image StyleGAN-based generator,
05:35
shown in blue, at different steps according to the details needed.
05:40
Meaning that the content and style are sent at first because it needs to be translated.
05:44
Then, we will force the style in the generated image by iteratively feeding it into the network
05:50
at multiple steps with optimal proportions learned during training.
05:55
This allows the generator to control low to high-resolution details of the text appearance,
06:01
instead of being limited to low-resolution details if we only sent this style information
06:06
as inputs like as it is typically done.
06:08
Of course, there are more technical details in order to adapt everything and make it work,
06:13
but I will let you read their great paper linked in the description below if you would
06:17
like to learn more about how they achieved this in the more technical side of things.
06:22
I also wanted to mention that they openly shared some issues with complex scenes where
06:27
illumination or color changes caused problems, hurting realism just like other GAN-based
06:33
applications I previously covered transferring your face into cartoons or changing the background
06:38
of an image.
06:39
It's crucial and super interesting to see the limitations as they will help to accelerate
06:44
research.
06:45
To end on a more positive note, this is only the first paper attacking this complex task
06:50
with this level of generalization, and it is already extremely impressive.
06:54
I cannot wait to see the next versions!
06:57
As always, thank you for watching, and many thanks to Rebekah Hoogenboom for the support
07:02
on Patreon!