NVIDIA and Tel Aviv University's Approach to Conditioning Text-to-Image Modelsby@whatsai
603 reads

NVIDIA and Tel Aviv University's Approach to Conditioning Text-to-Image Models

tldt arrow
EN
Read on Terminal Reader

Too Long; Didn't Read

Text-to-Image models like DALLE or stable diffusion are really cool and allow us to generate fantastic pictures with a simple text input. But would it be even cooler to give them a picture of you and ask it to turn it into a painting? Imagine being able to send any picture of an object, person, or even your cat, and ask the model to transform it into another style like turning yourself into a cyborg of into your preferred artistic style or adding it to a new scene.

Company Mentioned

Mention Thumbnail
featured image - NVIDIA and Tel Aviv University's Approach to Conditioning Text-to-Image Models
Louis Bouchard HackerNoon profile picture

@whatsai

Louis Bouchard

About @whatsai
LEARN MORE ABOUT @WHATSAI'S EXPERTISE AND PLACE ON THE INTERNET.
react to story with heart

Text-to-Image models like DALLE or stable diffusion are really cool and allow us to generate fantastic pictures with a simple text input. But would it be even cooler to give them a picture of you and ask it to turn it into a painting? Imagine being able to send any picture of an object, person, or even your cat, and ask the model to transform it into another style like turning yourself into a cyborg of into your preferred artistic style or adding it to a new scene.

Basically, how cool would it be to have a version of DALLE we can use to photoshop our pictures instead of having random generations? Having a personalized DALLE, while making it much more simple to control the generation as “an image is worth a thousand words”. It would be like having a DALLE model that is just as personalized and addictive as the TikTok algorithm.

Well, this is what researchers from Tel Aviv University and NVIDIA worked on. They developed an approach for conditioning text-to-image models, like stable diffusion I covered last week, with a few images to represent any object or concept through the words you will send along your images. Transforming the object of your input images into whatever you want! Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/imageworthoneword/
►Paper: Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H.,
Chechik, G. and Cohen-Or, D., 2022. An Image is Worth One Word:
Personalizing Text-to-Image Generation using Textual Inversion. https://arxiv.org/pdf/2208.01618v1.pdf
►Code: https://textual-inversion.github.io/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

text-to-image models like dali or stable

0:02

diffusion are really cool and allow us

0:04

to generate fantastic pictures with a

0:07

simple text input but would it be even

0:09

cooler to give them a picture of you and

0:11

ask it to turn it into a painting

0:13

imagine being able to send any picture

0:15

of an object person or even your cat and

0:18

ask the model to transform it into

0:20

another style like turning yourself into

0:22

a cyborg into your preferred artistic

0:24

style or even adding it into a new scene

0:27

basically how cool would it be to have a

0:30

version of dali we can use to photoshop

0:32

our pictures instead of having random

0:35

generations

0:36

having a personalized dolly while making

0:39

it much more simple to control

0:41

generations as an image is worth a

0:44

thousand words it will be like having a

0:46

dali model that is just as personalized

0:49

and addictive as the tic tac algorithm

0:52

well this is what researchers from tel

0:54

aviv university and nvidia worked on

0:57

they developed an approach for

0:58

conditioning text-to-image models like

1:01

stable diffusion i covered last week

1:03

with a few images to represent any

1:05

object or concept through the words you

1:08

will send along your images transforming

1:11

the object of your input images into

1:13

whatever you want of course the results

1:15

still need work but this is just the

1:17

first paper tackling such an amazing

1:19

task that could revolutionize the design

1:22

industry as a fantastic youtuber

1:24

colleague will say just imagine two more

1:26

papers down the line so how can we take

1:29

a handful of pictures of an object and

1:31

generate a new image following a text

1:33

condition input to add the style or

1:35

transformation details to answer this

1:38

complex question let's have a look at

1:40

what reynold gal and his team came up

1:42

with the input images are encoded into

1:44

what they call absurdo word that you can

1:47

then use within your text generation

1:50

thus the paper name an image is worth

1:52

one word but how do they get this sort

1:55

of word and what is it

1:57

they start with three to five images of

2:00

a specific object they also use a

2:02

pre-trained text to image model in this

2:04

case they use latent diffusion the model

2:07

i covered not even a week ago which

2:09

takes any kind of inputs like images or

2:12

text and generates new images out of

2:15

them you can see it as a cooler and open

2:18

source deli if you haven't watched my

2:20

video yet you should pause this one

2:23

learn about this model and come back

2:25

here you'll love the video and learn

2:27

about the hottest architecture of the

2:29

moment so you have your input images and

2:32

the base model for generating images

2:34

conditioned and inputs such as text or

2:37

other images but what do you do with

2:39

your three to five images of an object

2:42

and how do you control the model's

2:43

results so precisely that your object

2:46

appears in the generations this is all

2:48

done during the training process of your

2:51

second model the text encoder using your

2:54

pre-trained and fixed image generator

2:56

model latent diffusion in this case

2:59

already able to take a picture and

3:00

reconstruct it you want to teach your

3:02

text encoder modal to match the absurdo

3:05

word to your encoded images or in other

3:08

words your representations taken from

3:11

your five images so you will feed your

3:13

images to your image generator network

3:16

and train your text encoder in reverse

3:19

to find out what fake words or certain

3:22

word would best represent all your

3:24

encoded images basically find out how to

3:27

correctly represent your concept in the

3:29

same space as where the image generation

3:32

process i described in my previous video

3:34

happens

3:36

then extract a fake word out of it to

3:38

guide future generations this way you

3:41

can inject your concept into any future

3:44

generations and add a few more words to

3:46

condition the generation even further

3:49

using the same pre-trained text-to-image

3:51

model so you will simply be training a

3:54

small model to understand where your

3:56

images lie in the latent space to

3:58

convert them into a fake word to use in

4:00

their regular image generation model you

4:03

don't even have to touch the image

4:05

generation model and that's quite a big

4:07

deal considering how expensive they are

4:09

to train and voila this is how you can

4:12

teach a like model to generate image

4:14

variations of your preferred object or

4:17

perform powerful style transfers

4:19

of course this is just an overview of

4:21

this new method tackling a very very

4:24

interesting task and i invite you to

4:26

read their paper linked below for a

4:28

deeper understanding of the approach and

4:30

challenges it's a very complicated task

4:33

and there are still a lot of limitations

4:35

like the time it takes to understand

4:37

such a concept in a fake word which is

4:39

roughly two hours it's also not yet

4:42

capable of completely understanding the

4:44

concept but is pretty damn close there

4:47

are also a lot of risks in having such a

4:49

product accessible that we need to

4:51

consider imagine being able to embed the

4:54

concept of a specific person and

4:56

generate anything involving the person

4:58

in a few seconds this is quite scary and

5:01

this kind of technology is just around

5:03

the corner

5:04

i'd love to hear your thoughts in the

5:06

comment section or discuss this on the

5:09

discord server

5:10

thank you for watching the video and i

5:12

will see you next week with another

5:14

amazing paper

5:22

[Music]



RELATED STORIES

L O A D I N G
. . . comments & more!
Hackernoon hq - po box 2206, edwards, colorado 81632, usa