Make-A-Scene is not “just another Dalle”. The goal of this new model isn’t to allow users to generate random images following text prompt as dalle does — which is really cool — but restricts the user control on the generations.
Instead, Meta wanted to push creative expression forward, merging this text-to-image trend with previous sketch-to-image models, leading to “Make-A-Scene”: a fantastic blend between text and sketch-conditioned image generation. Learn more in the video...
►Read the full article: https://www.louisbouchard.ai/make-a-scene/
►Meta's blog post: https://ai.facebook.com/blog/greater-creative-control-for-ai-image-generation
►Paper: Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and
Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation
with human priors.
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
[Music]
0:06
this is make a scene it's not just
0:08
another deli the goal of this new model
0:11
isn't to allow users to generate random
0:13
images following text prompt as dali
0:15
does which is really cool but restricts
0:17
the user control on the generations
0:20
instead meta wanted to push creative
0:22
expression forward merging this text to
0:25
image trend with previous sketch to
0:27
image models leading to make a scene a
0:30
fantastic blend between text and sketch
0:32
conditioned image generation this simply
0:35
means that using this new approach you
0:37
can quickly sketch out a cat and write
0:40
what kind of image you would like and
0:42
the image generation process will follow
0:43
both the sketch and the guidance of your
0:45
text it gets us even closer to being
0:48
able to generate the perfect
0:49
illustration we want in a few seconds
0:52
you can see this multimodal generative
0:54
ai method as a daily model with a bit
0:57
more control over the generations since
0:59
it can also take in a quick sketch as
1:01
input this is why we call it multimodal
1:04
since it can take multiple modalities as
1:07
inputs like text and an image a sketch
1:10
in this case compared to delhi which
1:12
only takes text to generate an image
1:14
multi-modal models are something super
1:17
promising especially if we match the
1:19
quality of the results we see online
1:21
since we have more control over the
1:23
results getting closer to a very
1:25
interesting end goal of generating the
1:27
perfect image we have in mind without
1:30
any design skills of course this is
1:32
still in the research state and is an
1:34
exploratory ai research concept it
1:37
doesn't mean what we see isn't
1:38
achievable it just means it will take a
1:41
bit more time to get to the public the
1:43
progress is extremely fast in the field
1:45
and i wouldn't be surprised to see it
1:47
live very shortly or a similar model
1:49
from other people to play with i believe
1:52
such sketch and text-based models are
1:54
even more interesting especially for the
1:56
industry which is why i wanted to cover
1:58
it on my channel even though the results
2:00
are a bit behind those of daily 2 we see
2:03
online and it's not only interesting for
2:05
the industry but for artists too some
2:08
use the sketch feature to generate even
2:10
more unexpected results than what delhi
2:13
could do we can ask it to generate
2:14
something and draw a form that doesn't
2:17
represent the specific thing like
2:18
drawing a jellyfish in a flower shape
2:21
which may not be impossible to have with
2:23
dali but much more complicated without
2:25
sketch guidance as the model will only
2:27
reproduce what it learns from which
2:29
comes from real world images and
2:32
illustrations so the main question is
2:34
how can they guide the generations with
2:36
both text input like delhi and a sketch
2:39
simultaneously and have the model follow
2:41
both guidelines well it's very very
2:44
similar to how delhi works so i won't
2:47
enter too much into the details of a
2:49
generative model as i covered at least
2:51
five different approaches in the past
2:53
two months which you should definitely
2:55
watch if you haven't yet as these models
2:57
like dali 2 or imogen are quite
2:59
fantastic
3:00
typically these models will take
3:02
millions of training examples to learn
3:04
how to generate images from text with
3:07
data in the form of images and their
3:09
captions scraped from the internet here
3:12
during training instead of only relying
3:14
on the caption generating the first
3:17
version of the image and comparing it to
3:19
the actual image and repeating this
3:21
process numerous times with all our
3:23
images we will also feed it a sketch
3:26
what's cool is that the sketches are
3:28
quite easy to produce for training
3:30
simply take a pre-trained network you
3:32
can download online and perform instance
3:35
segmentation for those who wants the
3:37
details they use a free pre-trained vgg
3:40
model on imagenet so a quite small
3:42
network compared to those today super
3:44
accurate and fast producing results like
3:47
this called a segmentation map they
3:49
simply process all their images once and
3:52
get these maps for training the model
3:55
then use this map as well as the caption
3:58
to orient the model to generate the
4:00
initial image at inference time or when
4:02
one of us will use it our sketch will
4:05
replace those maps as i said they used a
4:08
model called vgg to create fake sketches
4:11
for training they use a transformer
4:13
architecture for the image generation
4:15
process which is different from dolly to
4:17
and i invite you to watch the video i
4:19
made introducing transformers for vision
4:21
applications if you'd like more details
4:23
on how it can process and generate
4:25
images this sketch guided transformer is
4:28
the main difference with magazine along
4:30
with not using an image text ranker like
4:33
clip to measure text and image pairs
4:36
which you can also learn about in my
4:37
daily video
4:39
instead all the encoded text and
4:41
segmentation maps are sent to the
4:43
transformer model the model then
4:45
generates the relevant image tokens
4:48
encoded and decoded by the corresponding
4:50
networks mainly to produce the image the
4:53
encoder is used during training to
4:55
calculate the difference between the
4:57
produced and initial image but only the
4:59
decoder is needed to take this
5:01
transformer output and transform it into
5:04
an image
5:05
and voila this is how meta's new model
5:08
is able to take a sketch and text inputs
5:11
and generate a high definition image out
5:13
of it allowing more control over the
5:16
results with great quality
5:18
and as they say it's just the beginning
5:20
of this new kind of ai model the
5:22
approaches will just keep improving both
5:24
in terms of quality and availability for
5:27
the public which is super exciting many
5:30
artists are already using the model for
5:32
their own work as described in meta's
5:34
blog post and i'm excited about when we
5:37
will be able to use it too their
5:39
approach doesn't require any coding
5:41
knowledge only a good sketching hand and
5:43
some prompt engineering which means
5:45
trial and error with the text inputs
5:48
tweaking the formulations and words used
5:50
to produce different and better results
5:53
of course this was just an overview of
5:55
the new make a scene approach and i
5:57
invite you to read the full paper linked
5:59
below for a complete overview of how it
6:02
works i hope you've enjoyed this video
6:04
and i will see you next week with
6:06
another amazing paper
6:09
[Music]