Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents.
As we said, Gato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics...
►Read the full article: https://www.louisbouchard.ai/deepmind-gato/
►Deepmind's blog post: https://www.deepmind.com/publications/a-generalist-agent
►Paper: Reed S. et al., 2022, Deemind: Gato, https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/
0:00
Gato from deepmind was just published
0:02
it's a single transformer that can play
0:04
atari games caption images chat with
0:07
people control a real robotic arm and
0:09
more indeed is trained once and uses the
0:12
same weights to achieve all those tasks
0:15
and as per deepmind this is not only a
0:17
transformer but also an agent this is
0:20
what happens when you mix transformers
0:22
with progress on multi-task
0:23
reinforcement learning agents as we said
0:26
gato is a multi-modal agent meaning that
0:29
it can create captions for images or
0:31
answer questions as a chatbot you'd see
0:34
that gpt3 can already do that but ghetto
0:36
can do more the multimodality comes from
0:39
the fact that ghetto can also play atari
0:41
games at the human level or even do real
0:44
world tasks like controlling robotic
0:46
arms to move objects precisely it
0:48
understands words images and even
0:51
physics ghetto is the first generalist
0:54
model that performs so well on so many
0:56
different tasks and it's extremely
0:58
promising for the field it was trained
1:00
on 604 distinct tasks with varying
1:03
modalities observations and action
1:06
specifications making it the perfect
1:08
generalist and as i said it does all
1:11
that with the same network and weights
1:13
and before you ask it only needs 1.2
1:15
billion parameters compared to gpt3 that
1:18
requires
1:19
175 billion of them it's not a trap
1:22
where you have to retrain or fight unit
1:24
to all tasks you can send both an image
1:27
and text and it will work you can even
1:29
add in a few movements from a robot arm
1:32
the model can decide which type of
1:34
output to provide based on its context
1:36
ranging from text to discrete actions in
1:38
an environment if you enjoyed the video
1:41
please consider subscribing and let me
1:43
know if you like this kind of news video
1:46
i definitely do more this is possible
1:48
because of their tokenization process
1:50
tokenization is when you prepare your
1:52
inputs for the modal as they do not
1:55
understand text or images by themselves
1:57
language models and ghetto took the
1:59
total number of sub words for example 32
2:02
000 and each word has a number assigned
2:05
to it for images they follow the vit
2:08
patch embedding using a widely used
2:10
resnet block as we covered in a previous
2:12
video we also tokenized the button
2:14
presses as integer numbers for atari
2:16
games or discrete values finally for
2:19
continuous values like proprioceptive
2:21
inputs we talked about with the robotic
2:23
arms they encoded the different track
2:25
matrix into float numbers and added them
2:27
after the text tokens using all those
2:30
different inputs the agent adapts to the
2:32
current task to generate appropriate
2:34
outputs during training they use prompt
2:36
conditioning as in gpt3 with previously
2:39
sampled actions and observations the
2:42
progress in generalist rl agents in the
2:44
last years has been incredible and came
2:47
mainly from deepmind one could see that
2:49
they are moving the needle closer to
2:51
general ai or human level intelligence
2:55
if we can finally define it i love how
2:57
many details they gave in their paper
2:59
and i'm excited to see what they will do
3:01
or what other people will do using this
3:03
model's architecture the link to the
3:06
paper for more information about the
3:07
model is in the description i hope you
3:09
enjoyed this short video i just saw this
3:12
news when i woke up and i couldn't do
3:13
anything else than make this video
3:15
before starting my day it's just too
3:17
exciting i will see you next week with
another amazing paper