We’ve seen AI generate images from other images using GANs. Then, there were models able to generate questionable images using text. In early 2021, was published, beating all previous attempts to generate images from text input using CLIP, a model that links images with text as a guide. A very similar task called image captioning may sound really simple but is, in fact, just as complex. It is the ability of a machine to generate a natural description of an image. DALL-E It’s easy to simply tag the objects you see in the image but it is quite another challenge to understand what’s happening in a single 2-dimensional picture, and this new model does it extremely well! Watch the video References ►Read the full article: ►Paper: Mokady, R., Hertz, A. and Bermano, A.H., 2021. ClipCap: CLIP Prefix for Image Captioning. ►Code: ►Colab Demo: ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/clipcap/ https://arxiv.org/abs/2111.09734 https://github.com/rmokady/CLIP_prefix_caption https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing https://www.louisbouchard.ai/newsletter/ Video Transcript 00:00 we've seen ai generate images from other 00:02 images using guns then there were models 00:05 able to generate questionable images 00:07 using text in early 2021 dolly was 00:10 published beating all previous attempts 00:12 to generate images from text input using 00:14 clip a model that links images with text 00:16 as a guide a very similar task called 00:19 image captioning may sound really simple 00:21 but is in fact just as complex it's the 00:23 ability of a machine to generate a 00:25 natural description of an image indeed 00:28 it's almost as difficult as the machine 00:30 needs to understand the image and the 00:32 text it generates just like in text to 00:34 image synthesis it's easy to simply tag 00:36 the objects you see in the image this 00:38 can be done using a regular 00:39 classification model but it's quite 00:41 another challenge to understand what's 00:43 happening in a single two-dimensional 00:45 picture humans can do it quite easily 00:47 since we can interpolate from our past 00:49 experience and we can even put ourselves 00:51 in the place of the person in the 00:52 picture and quickly get what's going on 00:55 this is a whole other challenge for a 00:56 machine that only sees pixels yet these 00:59 researchers published an amazing new 01:01 model that does this extremely well in 01:03 order to publish such a great paper 01:05 about image captioning the researchers 01:07 needed to run many many experiments plus 01:10 their code is fully available on github 01:12 which means it is reproducible these are 01:14 two of the strong points of this episode 01:16 sponsor weights and biases if you want 01:18 to publish papers in big conferences or 01:20 journals and do not want to be part of 01:22 the 75 of the researchers that do not 01:25 share their code i'd strongly suggest 01:27 using weights and biases it changed my 01:29 life as a researcher and my work in my 01:31 company weights and biases will 01:32 automatically track each run the hyper 01:34 parameters the github version hardware 01:37 and osu's the python version packages 01:39 install and training script everything 01:41 you need for your code to be 01:42 reproducible without you even trying it 01:45 just needs a line of code to tell what 01:47 to track once and that's it please don't 01:50 be like most researchers that keep their 01:52 code a secret i assume mostly because it 01:54 is hardly reproducible and try out 01:56 weights and biases with the first link 01:58 below as the researchers explicitly said 02:00 image captioning is a fundamental task 02:02 in vision language understanding and i 02:05 entirely agree the results are fantastic 02:07 but what's even cooler is how it works 02:09 so let's dive into the model and it's in 02:12 our working a little before doing so 02:14 let's quickly review what image 02:16 captioning is image captioning is where 02:18 an algorithm will predict a textual 02:20 description of a scene inside an image 02:22 here it will be done by a machine and in 02:25 this case it will be a machine learning 02:27 algorithm this algorithm will only have 02:29 access to the image as input and will 02:31 need to output such a textual 02:33 description of what is happening in the 02:36 image in this case the researchers used 02:38 clip to achieve this task if you are not 02:40 familiar with how clip works or why it's 02:42 so amazing i'd strongly invite you to 02:45 watch one of the many videos i made 02:47 covering it in short clip links images 02:49 to text by encoding both types of data 02:52 into one similar representation where 02:54 they can be compared this is just like 02:56 comparing movies with books using a 02:58 short summary of the piece given only 03:00 such a summary you can tell what's it 03:02 about and compare both but you have no 03:04 idea whether it's a movie or a book in 03:07 this case the movies are images and the 03:09 books are text descriptions then clip 03:11 creates its own summary to allow simple 03:14 comparisons between both pieces using 03:16 distance calculation on bit differences 03:19 you can already see how clips seems 03:21 perfect for this task but it requires a 03:23 bit more work to fit our needs here clip 03:26 will simply be used as a tool to compare 03:28 text inputs with images inputs so we 03:30 still need to generate such a text that 03:32 could potentially describe the image 03:34 instead of comparing the text to images 03:36 using clips encoding they will simply 03:39 encode the image using clips network and 03:41 use the generated encoded information as 03:44 a way to guide a future text generation 03:46 process using another model such a task 03:49 can be performed by any language model 03:51 like gpt3 which could improve their 03:53 results but the researchers opted for 03:54 its predecessor gpd2 a smaller and more 03:57 intuitive version of the powerful openai 04:00 model they are basically conditioning 04:02 the text generation from gpt2 using 04:04 clips encoding so clips model is already 04:07 trained and they also used a pre-trained 04:10 version of gpd2 that they will further 04:12 train using the clips encoding as a 04:14 guide to orient the text generation it's 04:17 not that simple since they still need to 04:19 adapt the clips encoding to a 04:21 representation that gpt2 can understand 04:23 but it isn't that complicated either it 04:25 will simply learn to transfer the clips 04:27 encoding into multiple vectors with the 04:30 same dimensions as a typical word 04:32 embedding this step of learning how to 04:34 match clips outputs to gpd2's inputs is 04:37 the step that will be thought during 04:39 training as both gpt2 and clip are 04:41 already trained and they are powerful 04:43 models to do their respective tasks so 04:45 you can see this as a third model called 04:47 a mapping network with the sole 04:49 responsibility of translating one's 04:51 language into the other which is still a 04:53 challenging task if you are curious 04:55 about the actual architecture of such a 04:57 mapping network they tried with both a 04:59 simple multi-layer perceptron or mlp and 05:02 a transformer architecture confirming 05:04 that the latter is more powerful to 05:06 learn a meticulous set of embeddings 05:08 that will be more appropriate for the 05:09 task when using powerful pre-trained 05:11 language models if you are not familiar 05:14 with transformers you should take five 05:15 minutes to watch the video i made 05:17 covering them as you will only more 05:19 often stumble upon this type of network 05:21 in the near future the model is very 05:23 simple and extremely powerful just 05:26 imagine having clip merge with gpt3 in 05:28 such a way we could use such a model to 05:30 describe movies automatically or create 05:32 better applications for blind and 05:34 visually impaired people that's 05:36 extremely exciting for real world 05:38 applications of course this was just an 05:40 overview of this new model and you can 05:42 find more detail about the 05:44 implementation in the paper linked in 05:46 the description below i hope you enjoyed 05:48 the video and if so please take a second 05:51 to share it with a friend that could 05:52 find this interesting it will mean a lot 05:55 and help this channel grow thank you for 05:57 watching and stay tuned for my next 05:59 video the last one of the year and quite 06:01 an exciting one 06:06 [Music]