Recent AI models such as ChatGPT and Midjourney have showcased impressive capabilities in generating text and images.
However, there are also models that specialize in understanding these inputs, such as the Vision Transformers (ViT) for images and Pathways language model (PaLM) for text. These models can interpret and comprehend the meaning of images and sentences.
Combining both text and image models would result in an AI that can understand various forms of data and would be able to comprehend nearly everything.
However, the capabilities of such a model may seem limited at first glance, as it would only be able to understand things. But, what if this model is integrated with a robotic system that can move in the physical world? This is where PaLM-E comes in.
Google's latest publication, PaLM-E, is an embodied multimodal language model.
This means that it is a model that can interpret and understand various types of data, including images and text from ViT and PaLM models respectively, and convert this information into actions through a robotic hand.
Learn more in the video…