Learn how this algorithm can understand images and automatically remove the undesired object or person and save your future Instagram post!
You’ve most certainly experienced this situation once: You take a great picture with your friend, and someone is photobombing behind you, ruining your future Instagram post. Well, that’s no longer an issue. Either it is a person or a trashcan you forgot to remove before taking your selfie that’s ruining your picture. This AI will just automatically remove the undesired object or person in the image and save your post. It’s just like a professional photoshop designer in your pocket, and with a simple click!
This task of removing part of an image and replacing it with what should appear behind has been tackled by many AI researchers for a long time. It is called image inpainting, and it’s extremely challenging. Learn more in the video!
► Complete article: https://www.louisbouchard.ai/lama/
► Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A.,
Silvestrov, A., Kong, N., Goka, H., Park, K. and Lempitsky, V., 2022.
Resolution-robust Large Mask Inpainting with Fourier Convolutions. In
Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (pp. 2149-2159).
► Code: https://github.com/saic-mdal/lama
► Colab Demo: https://colab.research.google.com/github/saic-mdal/lama/blob/master/colab/LaMa_inpainting.ipynb
Product using LaMa: https://cleanup.pictures/
► Fourier Domain explained by the great @3Blue1Brown :
► Great in-depth explanation of LaMa with the authors by @Yannic Kilcher :
► My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
00:00
you've most certainly experienced this
00:01
situation once you take a great picture
00:04
with your friend and someone is
00:05
photobombing behind you ruining your
00:07
future instagram post well that's no
00:10
longer an issue either it's a person or
00:12
a trash can you forgot to remove before
00:14
taking your selfie that's ruining your
00:16
picture this ai will automatically
00:18
remove the undesired object or person in
00:21
the image and save your post it's just
00:23
like a professional photoshop designer
00:25
in your pocket with a simple click this
00:27
task of removing part of an image and
00:29
replacing it with what should appear
00:31
behind it has been tackled by many ai
00:34
researchers for a long time it's called
00:36
image and painting and it's extremely
00:38
challenging as you will see the paper i
00:40
want to show you achieves it with
00:41
incredible results and can do it easily
00:43
in high definition unlike previous
00:45
approaches you may have heard of before
00:48
you definitely want to stay until the
00:49
end of the video to see that you won't
00:51
believe how great and realistic it looks
00:54
for something produced in a split second
00:56
by an algorithm as i said this task of
00:59
imaging painting is basically you
01:00
removing unwanted objects from your
01:02
images you should be doing the same in
01:05
your work life and remove any friction
01:07
your next step as an ar professional or
01:09
student to do that should be to do like
01:11
me and try the sponsor of today's
01:13
episode weights and biases if you run a
01:16
lot of experiments you should be using
01:18
weights and vices it will remove all
01:20
painful steps from hyperparameter tuning
01:22
to results analysis with a handful of
01:25
lines of code added and it's entirely
01:27
free for personal usage it takes not
01:29
even five minutes to set up and you
01:31
don't have anything else to do forever
01:34
talking about removing friction points i
01:36
don't think you can do better than that
01:38
weight and biases has everything you
01:39
need for your code to be reproducible
01:42
without you even trying for your
01:44
well-being do like me and give weights
01:46
and biases a try for free with the first
01:48
link below to remove an object from an
01:50
image the machine needs to understand
01:52
what should appear behind the subject
01:54
and to do this will require having a
01:56
three-dimensional understanding of the
01:58
world as humans do but it doesn't have
02:00
that it just has access to a few pixels
02:03
in an image which is why it's so
02:05
complicated whereas it looks quite
02:06
simple for us that can simply imagine
02:09
the depth and guess that there should be
02:11
the rest of the wall here the window and
02:13
etc we basically need to teach the
02:15
machine how the world typically looks
02:17
like
02:18
so we will do that using a lot of
02:20
examples of real world images so that it
02:23
can have an idea of what our world looks
02:25
like in the two-dimensional picture
02:27
world which is not a perfect approach
02:29
but does the job then another problem
02:32
comes with the computational cost of
02:34
using real-world images with way too
02:36
many pixels to fix that most current
02:38
approaches work with low quality images
02:41
so a downsized version of the image that
02:43
is manageable for our computers and
02:46
upscale the inpainted part at the end to
02:48
replace it in the original image making
02:50
the final results look worse than it
02:53
could be or at least they won't look
02:55
great enough to be shared on instagram
02:57
and have all the likes you deserve you
02:59
can't really feel it high quality images
03:01
directly as it will take way too much
03:03
time to process and train or can you
03:06
well these are the main problems the
03:08
researchers attacked in this paper and
03:10
here's how roman suvarov ital from
03:13
samsung research introduced a new
03:15
network called llama that is quite
03:17
particular as you can see in image and
03:19
painting you will typically send the
03:21
initial image as well as what you'd like
03:23
to remove from it this is called a mask
03:26
and will cover the image as you can see
03:28
here and the network won't have access
03:30
to this information anymore as it needs
03:32
to fill in the pixels then it has to
03:35
understand the image and try to fill in
03:37
the same pixels it thinks should fit
03:39
best so in this case they start like any
03:41
other network and downscale the image
03:44
but don't worry their technique will
03:45
allow them to keep the same quality as a
03:47
high resolution image this is because
03:50
here in the processing of the image they
03:52
use something a bit different than usual
03:54
typically we can see different networks
03:56
here in the middle mostly convolutional
03:58
neural networks such networks are often
04:01
used on images due to how convolutions
04:03
work which i explained in other videos
04:05
like the one appearing on the top right
04:07
of your screen if you are interested in
04:09
how it works in short the network will
04:11
work in two steps first it will compress
04:14
the image and try to only save relevant
04:16
information the network will end up
04:18
conserving mostly the general
04:20
information about the image like its
04:22
color overall style or general object
04:24
appearing but not precise details then
04:27
it will try to reconstruct the image
04:29
using the same principles but backward
04:32
we use some tricks like skip connections
04:34
that will save information from the
04:35
first few layers of the network and pass
04:38
it along the second step so that it can
04:40
orient it towards the right objects in
04:42
short it easily knows that there's a
04:44
tower with a blue sky and trees called
04:47
global information but it needs the skip
04:49
connections to know that it's the eiffel
04:51
tower in the middle of the screen that
04:53
there are clouds here and there the
04:55
trees have these colors etc all the fine
04:58
grained details which we call local
05:00
information following a long training
05:02
with many examples we will expect our
05:04
network to reconstruct the image or at
05:06
least a very similar image that contains
05:09
the same kind of objects and be very
05:11
similar if not identical to the initial
05:14
image but remember in this case we are
05:16
working with low quality images that we
05:18
need to upscale which will hurt the
05:20
quality of the results the particularity
05:22
here is that instead of using
05:24
convolutions as in regular convolutional
05:26
networks and skip connections to keep
05:28
local knowledge it uses what we call the
05:31
fast fourier convolution or ffc this
05:34
means that the network will work in both
05:36
the spatial and frequency domains and
05:39
doesn't need to get back to the early
05:40
layers to understand the context of the
05:42
image each layer will work with
05:44
convolutions in the spatial domain to
05:46
process local features and use 4g
05:49
convolutions in the frequency domain to
05:51
analyze global features the frequency
05:53
domain is a bit special and i linked a
05:55
great video covering it in the
05:57
description below if you are curious it
05:58
will basically transform your image into
06:00
all possible frequencies just like sound
06:03
waves and tell you how much of each
06:05
frequency the image contains so each new
06:09
pixel of this newly created image will
06:11
represent a frequency covering the whole
06:13
spatial image and how much it is present
06:16
instead of colors the frequencies here
06:19
are just the repeated patterns at
06:21
different scales for example one of
06:23
these frequency pixels could be highly
06:25
activated by the vertical lines at a
06:27
specific distance from each other in
06:29
this case it could be the same distance
06:31
as the length of a brick so it will be
06:33
highly activated if there is a brick
06:35
wall in the image from this you'd
06:37
understand that there's probably a brick
06:39
wall and the size proportional to how
06:41
much it is activated and you can repeat
06:43
this for all pixels being activated for
06:45
similar patterns giving you good hints
06:48
of the overall aspect of the image but
06:50
nothing about the object themselves or
06:52
the colors the spatial domain will take
06:54
charge of this so doing convolutions on
06:57
this new 4d image allows you to work
06:59
with the whole image at each step of the
07:01
convolution process so it has access to
07:04
a much better global understanding of
07:05
the image even at early layers without
07:08
much computational cost which is
07:10
impossible to achieve with regular
07:11
convolutions in the spatial domain then
07:14
both global and local results are saved
07:17
and sent to the next layer which will
07:19
repeat these steps you will end up with
07:21
the final image that you can upscale
07:23
back the use of the fourier domain is
07:25
what makes it scalable to bigger images
07:27
as their image resolution doesn't affect
07:29
the fourier domain since it uses
07:31
frequencies over the whole image instead
07:34
of colors and the repeated pattern it's
07:36
looking for will be the same whatever
07:38
the size of the image meaning that even
07:40
with training this network with small
07:42
images you will be able to feed it much
07:44
larger images afterward and get amazing
07:47
results
07:54
as you can see the results are not
07:55
perfect but they are quite impressive
07:57
and i'm excited to see what they will do
07:59
next to improve them of course this was
08:01
just a simple overview of this new model
08:03
and you can find more detail about the
08:05
implementation in the paper linked in
08:07
the description below you can also
08:09
implement it yourself with the code link
08:11
below as well i hope you enjoyed the
08:13
video and if so please take a second to
08:16
share it with a friend that could find
08:17
this interesting it will mean a lot and
08:20
help this channel grow
08:21
thank you for watching
08:24
[Music]