Table of Contents
CLIP aka “Contrastive Language Image Pre-training” is one of the renowned algorithms discussed in a white paper named “Learning Transferable Visual Models From Natural Language Supervision” published by researchers at OpenAI - An artificial intelligence research laboratory. The major consumption of CLIP is done in the use cases based on computer vision that uses an algorithm named “Dall-E 2” which was also developed by the Open AI team. More precisely, CLIP is being used as a helper model for the “Dall-E 2” algorithm. But don’t misunderstand that CLIP is not powerful as it is consumed as a helper model :)
Despite being a helper model, CLIP is considered an important step in deep learning research. We can implement CLIP for separate problem-solving even without “Dall-E 2”. In this article, we can discuss the objective, working procedure and some of the pros & cons of the CLIP through some real-life examples and how we can simplify the life of our deep learning projects.
The primary objective behind the CLIP algorithm is to find a particular text from a list of texts which is more similar to the given image.
For example,
Let us consider the following image as input-
And let's say we have some of the texts in a given list-
The primary task of a CLIP model is to match the most appropriate text from the given list to the input image as shown below-
Fundamentally, this is an artificial neural network that considers each text in the list as a class and assigns a probability value for each text corresponding to the image. Logically, the text that gets the maximum probability value can be considered as the output.
One big positive aspect of the CLIP is that it already knows all the words in the “English” language. Some of the characteristics of the CLIP model that makes it special when compared with other similar algorithms are-
The CLIP model is not restricted to a single word in the text. Instead, it tries to extract every piece of information from all the words in the input sentence and all the pixels of an image. It never forgets to remember all the aspects of an input image such as the objects in the background, colour, shapes etc.
For example,
Let us consider the following input image-
All of the texts in the given list except the last one look like a logical match for the input. Any other model would have struggled to reach a high conviction probability value for a particular class. However, the CLIP would have analysed the patterns for all the aspects in this image such as the kennel, cell, dog etc.
The sunlight seems to be coming from outside to inside. Hence, it should be an indoor structure. Also, there is a presence of an animal instead of a human. Hence, it should not be a Jail but could be a Kennel.
Such kind of advanced analysis considering all the aspects of the image & text might not be possible for other models in the same league.
The CLIP algorithm has been trained on 400 million images with paired text descriptions that make it highly knowledgeable about the universe and confident in solving complex tasks with complex images and texts.
The Imagenet dataset consists of only 1.2 million images. 400 million is almost 300 times that of 1.2 million. Most of the 400 million images are directly scraped from the internet which makes it a highly diversified and large collection that increases its pattern detection capability.
For the development of the CLIP architecture, we need to encode both the images and corresponding texts into mathematical vectors. This is because a machine learning algorithm won't able to infer the information if it's in a visual or textual format. Hence, we need to convert them into numerical values.
Image input is converted into a mathematical vector using a Transformer or Resnet algorithm
The textual input is into a mathematical vector using a Transformer algorithm-
Since we have a list of Image-Text pairs, we need to denote it using certain alphabets.
Each image is denoted as I1, I2,I3...IN etc.
Each text is denioted as T1,T2,T3…TN etc.
After that, we need to build a similarity matrix with each of the images as rows and each of the texts as columns.
As mentioned in the above image, the diagonal image-text pairs will have more similarity as they are referring to the same context. The non-diagonal elements are random pairs that do not belong to the same context. Hence, their similarity value will be low.
The goal of the optimisation functions will be to increase the similarity value for the diagonal pairs as much as possible and decrease the similarity between non-diagonal image-text pairs.
At one point in learning, the model will be able to learn the hidden patterns that match the images and texts that belong to the same context and differentiate the images and texts that belong to different contexts.
This procedure is technically called “Contrastive pre-training”.
CLIP is considered a “computationally efficient” algorithm. This is because they use the transformer algorithm to encode images and texts which access the data in a parallel fashion. If we use other algorithms such as LSTM or RNN, they tend to access the data for encoding in a serial fashion which might consume a lot of time and space.
Since the CLIP can match an image with a long sentence, the researchers usually create a text prompt something like “A photo of a _____ “. Then, while iterating through a list of texts, the computer program automatically fits every single word from the list into this text prompt like-
This text is then encoded and matched with the encoded vector of the input image for calculating the similarity value.
On datasets with training splits, the performance of zero-shot CLIP is on average competitive with the simple supervised baseline of a linear classifier on top of ResNet-50 features. On most of these datasets, the performance of this baseline is now well below the overall state of the art. Significant work is still needed to improve the task-learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance and suggests a route for continued improvement, researchers estimate around a 1000x increase in computing is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware. Further research into improving the computational and data efficiency of CLIP will be necessary.
It is found that CLIP’s zero-shot performance is still quite weak on several kinds of tasks. When compared to task-specific models, the performance of CLIP is poor on several types of fine-grained classification such as differentiating models of cars, species of flowers, and variants of aircraft. CLIP also struggles with more abstract and systematic tasks such as counting the number of objects in an image. Finally, for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random.
While zero-shot CLIP generalizes well to many natural image distributions as investigated, the researchers observed that zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it.
For example, CLIP learns a high-quality semantic OCR representation that performs well on digitally rendered text, which is common in its pre-training dataset, as evidenced by performance on Rendered SST2.
However, CLIP only achieves 88% accuracy on the handwritten digits of MNIST. An embarrassingly
simple baseline of logistic regression on raw pixels outperforms zero-shot CLIP. Both semantic and near-duplicate nearest-neighbour retrieval verifies that there are almost no images that resemble MNIST digits in our pre-training dataset.
This suggests CLIP does little to address the underlying problem of brittle generalization of deep learning models. Instead, CLIP tries to circumvent the problem and hopes that by training on such a large and varied dataset, all data will be effectively in-distribution. This is a naive assumption that, as MNIST demonstrates, is easy to violate.
Although CLIP can flexibly generate zero-shot classifiers for a wide variety of tasks and datasets, CLIP is still limited to choosing from only those concepts in a given zero-shot classifier. This is a significant restriction compared to a truly flexible approach like image captioning which could generate novel outputs.
CLIP also does not address the poor data efficiency of deep learning. Instead, CLIP compensates by using a source of supervision that can be scaled to hundreds of millions of training examples. If every image seen during training of a CLIP model was presented at a rate of one per second, it would take 405 years to iterate through the 12.8 billion images seen over 32 training epochs. Combining CLIP with self-supervision and self-training methods is a promising direction given their demonstrated ability to improve data efficiency over standard supervised learning.
Some of the areas where CLIP has been used for solving real-time use cases are:
There is a website called “paint.wtf” where we can play Pictionary. The players here will be judged by the CLIP.
CLIP can be used for implementing filters such as “NSFW (Not safe for work)”.
“DALL-E”, an algorithm by Open AI is using CLIP as a helper model as we discussed earlier.
CLIP is used to index photos on websites like Unsplash.
CLIP can be used to find appropriate images for complex languages such as poetry, riddle, rhymes, novels etc.
CLIP can also be used to pick images that are corrupt or distorted. A new research paper titled, ‘Inverse Problems Leveraging Pre-trained Contrastive Representations,’ demonstrates how a supervised inversion method was used to get effective representations of corrupt images.
Released in 2021, a generative model called CLIP+VQGAN or Vector Quantized Generative Adversarial Network is used within the text-to-image paradigm to generate images of variable sizes, given a set of text prompts. However, unlike VQGAN, CLIP isn’t a generative model and is simply trained to represent both images and text effectively.
It is an undeniable fact in the deep learning industry that CLIP has paved a way for the development of lof advanced algorithms for solving complex use cases related to image processing and NLP.
CLIP can be considered an innovative aqueduct between computer vision and NLP. Also, Since it does not require task-specific training data, it is possible to feed huge amounts of text data and it would slowly get better and better at more unrelated tasks.
We can together eagerly wait for the breakthrough advancements that CLIP will be provided in the future. I hope you have got a basic introduction to the concept behind the CLIP algorithm in a lucid manner.
I have added the links to the research paper in the reference section which you can use in case you need to refer to the in-depth implementation.
CLIP documentation from Open AI
“Learning Transferable Visual Models From Natural Language Supervision" - Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
“Inverse Problems Leveraging Pre-trained Contrastive Representations"- Sriram Ravula*, Georgios Smyrnis*, Matt Jordan, Alexandros G. Dimakis, The University of Texas at Austin, NeurIPS 2021
"VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance"- Katherine Crowson, Stella Biderman,, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff
Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived (PDF) from the original on 26 January 2021. Retrieved 23 January 2021.
Johnson, Khari (5 January 2021). "OpenAI debuts DALL-E for generating images from text". VentureBeat. Archived from the original on 5 January 2021. Retrieved 5 January 2021.
Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (12 April 2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125.