Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents. As we said, Gato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics... Learn more in the video References ►Read the full article: ►Deepmind's blog post: ►Paper: Reed S. et al., 2022, Deemind: Gato, ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/deepmind-gato/ https://www.deepmind.com/publications/a-generalist-agent https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf https://www.louisbouchard.ai/newsletter/ Video transcript 0:00 Gato from deepmind was just published 0:02 it's a single transformer that can play 0:04 atari games caption images chat with 0:07 people control a real robotic arm and 0:09 more indeed is trained once and uses the 0:12 same weights to achieve all those tasks 0:15 and as per deepmind this is not only a 0:17 transformer but also an agent this is 0:20 what happens when you mix transformers 0:22 with progress on multi-task 0:23 reinforcement learning agents as we said 0:26 gato is a multi-modal agent meaning that 0:29 it can create captions for images or 0:31 answer questions as a chatbot you'd see 0:34 that gpt3 can already do that but ghetto 0:36 can do more the multimodality comes from 0:39 the fact that ghetto can also play atari 0:41 games at the human level or even do real 0:44 world tasks like controlling robotic 0:46 arms to move objects precisely it 0:48 understands words images and even 0:51 physics ghetto is the first generalist 0:54 model that performs so well on so many 0:56 different tasks and it's extremely 0:58 promising for the field it was trained 1:00 on 604 distinct tasks with varying 1:03 modalities observations and action 1:06 specifications making it the perfect 1:08 generalist and as i said it does all 1:11 that with the same network and weights 1:13 and before you ask it only needs 1.2 1:15 billion parameters compared to gpt3 that 1:18 requires 1:19 175 billion of them it's not a trap 1:22 where you have to retrain or fight unit 1:24 to all tasks you can send both an image 1:27 and text and it will work you can even 1:29 add in a few movements from a robot arm 1:32 the model can decide which type of 1:34 output to provide based on its context 1:36 ranging from text to discrete actions in 1:38 an environment if you enjoyed the video 1:41 please consider subscribing and let me 1:43 know if you like this kind of news video 1:46 i definitely do more this is possible 1:48 because of their tokenization process 1:50 tokenization is when you prepare your 1:52 inputs for the modal as they do not 1:55 understand text or images by themselves 1:57 language models and ghetto took the 1:59 total number of sub words for example 32 2:02 000 and each word has a number assigned 2:05 to it for images they follow the vit 2:08 patch embedding using a widely used 2:10 resnet block as we covered in a previous 2:12 video we also tokenized the button 2:14 presses as integer numbers for atari 2:16 games or discrete values finally for 2:19 continuous values like proprioceptive 2:21 inputs we talked about with the robotic 2:23 arms they encoded the different track 2:25 matrix into float numbers and added them 2:27 after the text tokens using all those 2:30 different inputs the agent adapts to the 2:32 current task to generate appropriate 2:34 outputs during training they use prompt 2:36 conditioning as in gpt3 with previously 2:39 sampled actions and observations the 2:42 progress in generalist rl agents in the 2:44 last years has been incredible and came 2:47 mainly from deepmind one could see that 2:49 they are moving the needle closer to 2:51 general ai or human level intelligence 2:55 if we can finally define it i love how 2:57 many details they gave in their paper 2:59 and i'm excited to see what they will do 3:01 or what other people will do using this 3:03 model's architecture the link to the 3:06 paper for more information about the 3:07 model is in the description i hope you 3:09 enjoyed this short video i just saw this 3:12 news when i woke up and i couldn't do 3:13 anything else than make this video 3:15 before starting my day it's just too 3:17 exciting i will see you next week with another amazing paper