Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents. As we said, Gato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics... Learn more in the video References ►Read the full article: ►Deepmind's blog  post: ►Paper: Reed S. et al., 2022, Deemind: Gato, ►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/deepmind-gato/ https://www.deepmind.com/publications/a-generalist-agent https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf https://www.louisbouchard.ai/newsletter/ Video transcript 0:00 Gato from deepmind was just published 0:02 it's a single transformer that can play 0:04 atari games caption images chat with 0:07 people control a real robotic arm and 0:09 more indeed is trained once and uses the 0:12 same weights to achieve all those tasks 0:15 and as per deepmind this is not only a 0:17 transformer but also an agent this is 0:20 what happens when you mix transformers 0:22 with progress on multi-task 0:23 reinforcement learning agents as we said 0:26 gato is a multi-modal agent meaning that 0:29 it can create captions for images or 0:31 answer questions as a chatbot you'd see 0:34 that gpt3 can already do that but ghetto 0:36 can do more the multimodality comes from 0:39 the fact that ghetto can also play atari 0:41 games at the human level or even do real 0:44 world tasks like controlling robotic 0:46 arms to move objects precisely it 0:48 understands words images and even 0:51 physics ghetto is the first generalist 0:54 model that performs so well on so many 0:56 different tasks and it's extremely 0:58 promising for the field it was trained 1:00 on 604 distinct tasks with varying 1:03 modalities observations and action 1:06 specifications making it the perfect 1:08 generalist and as i said it does all 1:11 that with the same network and weights 1:13 and before you ask it only needs 1.2 1:15 billion parameters compared to gpt3 that 1:18 requires 1:19 175 billion of them it's not a trap 1:22 where you have to retrain or fight unit 1:24 to all tasks you can send both an image 1:27 and text and it will work you can even 1:29 add in a few movements from a robot arm 1:32 the model can decide which type of 1:34 output to provide based on its context 1:36 ranging from text to discrete actions in 1:38 an environment if you enjoyed the video 1:41 please consider subscribing and let me 1:43 know if you like this kind of news video 1:46 i definitely do more this is possible 1:48 because of their tokenization process 1:50 tokenization is when you prepare your 1:52 inputs for the modal as they do not 1:55 understand text or images by themselves 1:57 language models and ghetto took the 1:59 total number of sub words for example 32 2:02 000 and each word has a number assigned 2:05 to it for images they follow the vit 2:08 patch embedding using a widely used 2:10 resnet block as we covered in a previous 2:12 video we also tokenized the button 2:14 presses as integer numbers for atari 2:16 games or discrete values finally for 2:19 continuous values like proprioceptive 2:21 inputs we talked about with the robotic 2:23 arms they encoded the different track 2:25 matrix into float numbers and added them 2:27 after the text tokens using all those 2:30 different inputs the agent adapts to the 2:32 current task to generate appropriate 2:34 outputs during training they use prompt 2:36 conditioning as in gpt3 with previously 2:39 sampled actions and observations the 2:42 progress in generalist rl agents in the 2:44 last years has been incredible and came 2:47 mainly from deepmind one could see that 2:49 they are moving the needle closer to 2:51 general ai or human level intelligence 2:55 if we can finally define it i love how 2:57 many details they gave in their paper 2:59 and i'm excited to see what they will do 3:01 or what other people will do using this 3:03 model's architecture the link to the 3:06 paper for more information about the 3:07 model is in the description i hope you 3:09 enjoyed this short video i just saw this 3:12 news when i woke up and i couldn't do 3:13 anything else than make this video 3:15 before starting my day it's just too 3:17 exciting i will see you next week with another amazing paper

Google

How Uber Uses AI to Improve Deliveries

BlobGAN: A BIG step for GANs

Watch more on YouTube: https://www.youtube.com/c/WhatsAI

2021 - HackerNoon Contributor of the Year - DEEP-LEARNING

2021 - HackerNoon Contributor of the Year - FACEBOOK

Nominated for 2022 - Best Data Science Newsletter

Nominated for 2022 - HackerNoon Contributor of the Year - Artificial Intelligence

Nominated for 2022 - Top Tech Youtuber

Nominated for 2022 - HackerNoon Contributor of the Year - Innovation

Nominated for 2022 - HackerNoon Contributor of the Year - Data Science

Nominated for 2022 - HackerNoon Contributor of the Year - Natural Language Processing

Deepmind May Have Just Created the World's First General AI

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

3D Articulated Shape Reconstruction from Videos

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps