DeepMind's Gato ښیي چې څنګه یو AI کولی شي هر څه په یوه وخت کې زده کړي

د نویسنده: Scott Reed Konrad Żołna Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Giménez Yury Sulsky Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas د نویسنده: سکاټ ریډ د رخصتۍ د امیلیو Parisotto سیرجیو ګیمز کولمیناریو الکساندر نوکیوف د جابريال بارت-مارون د ژمي یوري Sulsky جیکی کی د Jost Tobias Springenberg ټیم Eccles جیک بروس ایا راټول ایشلی ایډورډز نیولوک Heess یوټین Chen رایو Hadsell د Oriol Vinyals د بیلګې په توګه Nando د Freitas د Abstract په پراخه کچه د لغت ماډل کولو پرمختګونو له امله، موږ د یو واحد generalist ایجنټ د متن د صادراتو په سيمه کې د جوړولو لپاره ورته لارښوونې کاروي. د ایجنټ، چې موږ د Gato په نامه راځي، د ډیرو ماډل، ډیرو کارپوه، ډیرو اغیزمنو generalist پالیسۍ په توګه کار کوي. د ورته وزن سره ورته شبکې کولی شي د Atari، نندارتون انځورونه، چیټ، د حقیقي روبوټ brackets او ډیر ډیر سره stack بلوکونه ونیسئ، د دې کنټرول په اساس فیصلہ کړي که چیرې د متن، یوځای torque، ټکټ فشارونه یا نورو ټکینونه ونیسئ. په دې راپور کې موږ د ماډل او ډاټا بیان کوو، او د Gato په 1 نندارتون په ټولې کارونو کې د یو واحد عصري sequence ماډل کارولو لپاره مهم ګټې شتون لري. دا د هر ډومین لپاره د مناسب انډولیک پیاوړتیاوو سره د ګمرکولو سیاستونو ماډلونو اړتیا کموي. دا د روزنې ډاټا مقدار او پراخوالی زیاتوي ځکه چې sequence ماډل کولی شي د هر ډاټا چې کولی شي په یو فلیټ sequence serialized شي راټول کړي. برسېره پردې، د دې فعالیت د ډاټا، کمپیوټریټ او ماډل مقناطیسي کې هم ښه کوي په تاريخي توګه، د عمومي ماډلونه چې د کمپیوټرونو د ګټورولو لپاره ښه دي، هم د ځانګړو ډومین ځانګړي لارښوونې څخه ګټه ورکړي. په پای کې (د کارپوریشن او نور) د 2020 کال هورمون او نور. د 2022 کال). د سټینټ، د 2019 کال) په دې کاغذ کې، موږ د عمومي مقاصد ایجنټ چې موږ Gato په نامه رامینځته کوو، د یو واحد، لوی، transformer sequence ماډل په توګه مثالوي. سره یو واحد سیټ وزنونه، Gato کولی شي د تبادلې، ټایټ انځورونه، سره یو حقیقي روبوټ bracelet بکسونه رامینځته کړي، په Atari لوبو کې د انسانانو څخه غوره کړي، په 3D چاپیریالونو کې نړيوال کړي، د لارښوونې پیژندنه وکړي، او نور. په داسې حال کې چې نه یو ایجنټ کولی شي په ټولو تصور وړ کنترول ورکشاپونو کې غوره شي، په ځانګړې توګه هغه چې په خپل روزنې توزیع څخه ډیر بهر دي، موږ دلته د hypothesis ازموینه کوو چې یو ایجنټ چې په عمومي توګه د روزنې لپاره توان لري. د غوښتنلیکونه امکان دی؛ او چې دا عمومي ایجنټ کولی شي د ځینې اضافي ډاټا سره د دې لپاره چې د غوښتنلیکونو په ډیری ډیری ډیری ډیری کې بریالیتوب وکړي تنظیم شي. موږ تصور کوو چې دا ایجنټ کولی شي د معلوماتو، محاسبه او ماډل پارامترونو په واسطه ترلاسه شي، په دوامداره توګه د روزنې توزیع پراختیا، په داسې حال کې چې د کړنو د ساتنې لپاره، د هر کار، رفتار او د دلچسپي اغیزمنۍ پوښښ ته وده ورکړي. په دې ترتیبات کې، طبيعي lan-guage کولی شي په بل ډول غیر متوافق اغیزمنونو کې یو عام بنسټ شي، د نوي اغیزمنو لپاره ترکیب عمومي کوډل کولو کوښوي. لوی شمېر موږ زموږ روزنې په د نمونوي کچه د عملیاتو نقطې کې تمرکز کوو چې د واقعي نړۍ روبوټونو په واقعي وخت کې کنټرول کولو ته اجازه ورکوي، اوس مهال د Gato په صورت کې د 1.2B پارامترونو په اړه. لکه څنګه چې د هارډویر او ماډل آرکټیکشنونه پرمختګ کوي، دا عملیاتو نقطې به په طبيعي توګه د امکان وړ ماډل اندازه زیات کړي، generalist ماډلونه د کثافاتو قانون په لوړه کچه ته فشار ورکوي. د ساده کولو لپاره Gato په خالص څارنې سره آنلاین کې روزل شوي دي؛ مګر په اصل کې د دې لپاره هیڅ سبب نه شتون نلري چې دا هم د offline یا آنلاین وده کولو (RL) سره روزل کیږي. 2 د نمونوي Gato د ډیزاین لارښوونې اصل دا ده چې د اړونده معلوماتو تر ټولو پراخه کچه، لکه انځورونه، متن، Proprioception، یوځای تارونه، ټکټ فشارونه، او نورو پراخ او مداخله نظرونه او عملونه په شمول مختلفو حالتونو کې زده کړي. د دې ډیری ماډل ډاټا پروسس کولو لپاره، موږ ټول معلوماتو ته د ټکینونو فلیټ سلسله ته serialize. په دې نمونې کې، Gato کولی شي د معياري پراخه کچې زبان ماډل په څیر چمتو شي او نمونې ترلاسه کړي. په پراختیا کې، د نمونې ټکینونه د اړیکو ځوابونو، ټکټونو، ټکټ فشارونو یا نورو عملونو کې جوړ شوي دي. په لاندې برخو کې، موږ د Gato tokenization 2.1 د Tokenization د ډاټا په ټکونو کې بدلولو وړتیاوې بیلابیلو شتون لري، په شمول په مستقیم ډول د خام زیرمې بائټ جریان په کارولو سره. لاندې موږ د tokenization سیسټم راټول کوو چې موږ د Gato لپاره د اوسني کچه د عصري هارډویر او ماډل آرکټیکټیکونو په کارولو سره غوره پایلې تولید کوو. د متن د SentencePiece (Kudo & Richardson، 2018) له لارې کوډ شوی، د 32000 ثانیو سره په کلکه کچه [0, 32000]. انځورونه لومړی په ViT (Dosovitskiy et al., 2020) په توګه په raster ترتیب کې د 16 16 پوښونو په غیر پوښونو سیسټمونو کې بدل کیږي. په انځور کې هر پیکسل __p__atches وروسته [−1*,* 1] په منځ کې normalized کیږي او د پوښ په اندازه کې مربع ریټ له خوا تقسیم کیږي (یا √16 = 4). د غیرقانوني ارزښتونه، د مثال په توګه، د Atari بوتل فشارونه، په عمومي کچه د کلایلو سیسټمونو کې کچول شوي دي. د tokenized پایله د [0*,* 1024 لړ کې د کلایلو سیسټم دی). دوامداره ارزښتونه، د مثال په توګه proprioceptive inputs یا joint torques، په لومړي ځل کې په لړ-د لوی ترتیب کې د پلټنې ټیټ ارزښتونو لړونو ته واچول کیږي. ارزښتونه د [1 * * 1] لړ کې کوډ شوي دي، که بل نه دي (د تفصیل لپاره انځور 14 وګورئ)، بیا د 1024 واحد بینونو ته واچول شوي دي. د مختلف کلونه وروسته به د [32000 * * 33024] لړ ته بدل شي. د معلوماتو په tokens بدلولو وروسته، موږ د لاندې کانونيک sequence ترتیب کاروي. د متن tokens په ورته ترتیب کې چې د خام وارداتي متن. د انځور patch tokens په raster ترتیب کې. د Tensors په رڼا لوی ترتیب کې. په کلاسیکګرافیک ترتیب کې په کلیدونو کې جوړ شوي جوړښتونه. د ایجنټ د وختونو په توګه د نښلیدو tokens، وروسته له خوا د separator، او بیا د عمل tokens. په وخت کې په وخت کې په وخت کې په وخت کې په وخت کې په وخت کې په وخت کې. د tokenizing اډې ډاټا په اړه نور تفصيلات په اضافي موادو کې وړاندې شوي دي (د برخه) د B) 2.2 د انټرنېټ ټکینونه او د صادراتو هدفونه تنظیم کول د tokenization او sequencing وروسته، موږ د هر token لپاره د parameterized embedding فابريکه *f* ( ; *θe*) کاروي (یا دا د نظرونو او عملونو په دوامداره توګه کارول کیږي) ترڅو د پایلې ماډل انډول تولید کړي. له زموږ د ډیرو ماډل انډول sequence *s*1:*L* څخه اغیزمن زده کولو لپاره، د embedding فابريکه د token له خوا د موډل په اړه د مختلفو عملیاتو ترسره کوي: • ټکینونه چې د هر وخت وخت لپاره د متن، غیر معمولي یا مداوم ارزښت شوي نظرونه یا عملونو ته اړتیا لري، د څیړنې ټابلیټ له لارې په یو زده شوي ویټور انډول کولو فضا کې انډول شوي دي. د زده کولو وړ موقعیت کوډونه د ټولو ټکینونو لپاره د خپل محلي ټکین پوزیشن په اړه د هغه وخت کې شامل شوي دي. • د هر وخت لپاره د انځور پاچونو لپاره د ټکینونه د یو واحد ResNet کارولو په کارولو سره داخل شوي دي د انځور پاچ token انډولونو لپاره، موږ هم د انځور په داخله پوزیشن کوډ کولو ویټور اضافه کوو. (د هغه او نورو، په 2016 کې موږ د اضافي برخه په اړه اشاره کوو د نښلیدو دنده په اړه د تفصيلات لپاره. د C3 لکه څنګه چې موږ د معلوماتو په اتوماتیک ډول ماډل کړي، هر ټکین ممکن د مخکښ ټکینونو له امله هم یو هدف ټیکن وي. د متن ټکینونه، مخکښ او مداوم ارزښتونه، او عملونه کولی شي په مستقیم ډول د ټکین کولو وروسته د هدفونو په توګه جوړ شي. د انځور ټکینونه او د ایجنټ غیر متنوع نظرونه اوس مهال په ګاتو کې پیش نه شوي دي، مګر دا ممکن د راتلونکي کار لپاره یو دلچسپ لارښود وي. د دې غیر مخکښ ټکینونو هدفونه د غیر استعمال شوي ارزښت په توګه جوړ شوي دي او د کڅوړې لپاره د دوی د اړتياوو لپاره مخکښ دي. 2.3 د روزنې د tokens sequence له مخې 1 : د پارامترونو ، موږ د معلوماتو سره د احتمالي قانون د چڼاسکه په کارولو سره ماډل: s L θ ولډنګ موږ د masking فابريکه *m* په داسې حال کې تعریف کوو چې *m*(*b، l*) = 1 که د index *l* ټکین د متن څخه یا د یو اګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګانګان b لکه څنګه چې په اوسط کې بیان شوي، د ګټو د شبکې آرکټیکټیکټ د دوو اصلي برخو لري: د parameterized embedding فابريکه چې tokens ته token embeddings بدلوي، او sequence ماډل چې د راتلونکي diskrete token په اړه د توزیع صادروي. په داسې حال کې چې هر عمومي sequence ماډل کولای شي د راتلونکي token prediction لپاره کار وکړي، موږ د transformer (V) غوره کړ. د ساده او پراختیا لپاره. Gato د 1.2B پارامتر decoder-only transformer سره 24 layers، د 2048 د انډول اندازه، او د 8196 مخکښ اندازه وروسته تمرکز feedforward کاروي (د برخه کې نور تفصيلات) د نندارتون او al. 2017 کال د 1). ځکه چې په یوه ډومین کې د مختلفو کارونو کولای شي د ورته اغیزمنۍ، د نظارتو فارمونو او د عمل ځانګړتیاوې شریک کړي، د ماډل ځینې وختونه اړتیا لري چې د کارونو د مخکښولو لپاره اضافي کنکټور ته اړتیا لري. د مثال په توګه د واحد ګرمې کار identifier وړاندې کولو لپاره، موږ په بل ډول د انډول څخه ګټه واخلئ د روزنې په وخت کې، په هر بیل کې د 25٪ کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې. د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې، د روزنې په وخت کې. (د سپارښتنه او نور) د 2022 کال د وین او ایل. د 2021 کال د بریښنا او al. د 2020 کال) د نمونوي روزنې په 16x16 TPU v3 ټوټې کې د 1M ګامونو لپاره د 512 ټوټې اندازه او token sequence اوږدوالی سره ترسره کیږي = 1024، کوم چې د 4 ورځو په اړه اړتیا لري. د آرشیفیت تفصيلات په برخه کې شتون لري ځکه چې د ایجنټ انډولونه او سندونه آسانه کولی شي ډیری ټکینونه لري چې په مقالې کې مناسب وي، موږ په تصادفي توګه د پیژندنې پیژندنه کوو. د موجود انډولونو څخه ټکینونه. هر ټوټې په ډومینونو کې (د مثال په توګه، Atari، MassiveWeb، او داسې نور) په پرتله مساوي ډول لاندې سیسټمونه مخلوط کوي، د ډیرو او لوړ کیفیت ډاټا سیټونو ځینې لارښود وزن سره (د جدول وګورئ) په برخه کې د تفصيلاتو لپاره) L د L 1 3 2.4 د نندارتون د پالیسۍ په توګه د ګاډو پراختیا په انځور کې ښودل شوی لومړی یو پروپټ، لکه د ډیموشن، tokenized کیږي، د لومړنۍ sequence جوړوي. په معياري توګه، موږ د ډیموشن لومړي 1024 tokens واخلئ. وروسته، د چاپیریال د لومړي نظر ورکوي چې tokenized کیږي او د sequence سره تړل کیږي. Gato د عمل ویټور په اتوماتیک ډول د یو token په هر وخت کې نمونې کوي. کله چې د عمل ویټور شامل شوي ټول tokens نمونې شوي دي (د چاپیریال د عمل مشخصاتو له مخې) ، دا عمل د Tokenization پروګرام چې په برخه کې بیان شوي دي له لارې decoded کیږي دا عمل ته د چاپیریال ته لیږدول کیږي کوم چې ځله کوي او د نوي نظر ورکوي. د پروسه تکرار کیږي. د ماډل تل د 1024 ټوکن په kontekst پنجره کې د ټولو مخکښ نظرونه او عملونه وګورئ. موږ د تبادلې په وخت کې د transformer XL حافظه کارولو ګټه لرو، مګر دا د روزنې په وخت کې نه کارول شوی 3. 2.1 په اړه (د ډیزاین او نور) د 2019 کال). 3 ډاټا Gato is trained on a large number of datasets comprising agent experience in both simulated and real world environments, as well as a variety of natural language and image datasets. The datasets we use and their attributes are listed in Table د کنترول ډاټا سیټ په اړه د ټوکنونو محاسبه کچه د Tokenization میکانیزم په اساس محاسبه کیږي چې په برخه کې بیان شوي دي 1. 2.1 په اړه 3.1 Simulated کنترول ورکشاپ زموږ د کنترول ورکشاپونه د معلوماتو سیټونو څخه جوړ شوي دي چې د متخصص SoTA یا په مختلفو چاپیریالونو کې روزل شوي تقریبا SoTA تثبيت زده کونکو لخوا جوړ شوي دي. د هر چاپیریال لپاره موږ د اټکل کولو په وخت کې د اټکلونکي تجربه (اټیټونه، عملونه، او پاداشونه) د یو بیس سیټ ریکارډ کوو. د نمونې چاپیریالونه شامل دي Meta-World (Y د نمونوي meta-reforcement زده کړې او multi-task زده کړې، Sokoban د طرحې ستونزو په توګه پیشنهاد شوی، BabyAI د ګرځنده نړۍونو کې د ژور تدریس لپاره، د DM کنټرول سیټ (T د مداخله کنترول لپاره، او همدارنګه د DM لیب ډیزاین شوی چې د افسرونو د نړيوال او 3D بصیرت څخه د ګرځنده پیکسلونو له لارې د ځان په نقطې کې زده کړي. موږ د Arcade Learning Environment هم کاروي د کلاسیک Atari لوبو سره (او موږ د دوو لوبې سیټونه کاروي چې موږ د ALE Atari او ALE Atari Extended په نوم ولرئ) د تفصيلاتو لپاره) تاسو او AL. د 2020 کال) (د کټګورۍ او نور) 2017 کال (د کټګورۍ او داسې نور، د 2018 کال د نندارتون او نور. د 2020 کال) (د بوتل او نور) د 2016 کال (د بیلګې په توګه، د 2013 کال) د F1 موږ هم د Procgen Benchmark شامل دي د modular RL موږ هم د DM Manipulation Playground څخه د Kinova Jaco bracelet په کارولو سره د چارو کارونو شامل دي، لکه څنګه چې په برخه شامل دي د دې کنترول دندهونو د يو ژور شرح، او همدارنګه چې د RL ایجنټ د معلوماتو د توليد لپاره کارول. (د کلب او نورو، د 2020 کال) (د هانګ او آل، د 2020 کال). د زون او ال. (2020) په اړه F موږ په کارولو لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د تجربې لپاره د چیرته د دې کار لپاره د جمع شوي انډولونو مجموعي تعداد، ده د کړکۍ اندازه، او د Episode په بشپړه توګه راځي لپاره د دقیق اندازې ترلاسه کړي، په عملياتو کې، موږ 10٪ د مجموعي ډاټا حجم یا لږ تر لږه 1000 انډولونه (په توګه. = min(1000 * * 0 * * 1 ) ) . N W ریمو i W W د N 3.2 د لید او ژبه Gato په MassiveText کې روزل کیږي د ډیرو سرچینو څخه د انګلیسي ژوره متن ډاټا سیټونه: ویب پاڼه، کتابونه، اخبارو مقالات، او کوډ. (د رایو او نورو، د 2021 کال) موږ هم په Gato د روزنې کې د بصری زبان ډاټا سیټونه شامل کړ. ALIGN د 1.8B انځورونه او د دوی بدلون متن (alt-text) نوټونه شامل دي. LTIP (Long Text & Image Pairs)، د 312 ملیون انځورونه سره د نندارتونونو شامل دي. د د مفهوم Captions او د COCO Captions د د 3.3M او 120k انځور-کتاب جفتو سره د ډیټاټا سیټونه. د MultiModal MassiveWeb (M3W) ډاټا سیټ د شامل 43M ویب پاڼه چې په هر ډول متن او انځورونه extracted. موږ هم د بصری پوښتنې ځواب ډاټا سیټونه شامل دي. په ځانګړې توګه OKVQA د VQAv2 د 9K او 443K د انځورونو، پوښتنو او ځوابونو ټریپلټونه سره. د دې څخه د روزنې انډول جوړولو لپاره، موږ د پنجې (د انځور، متن) جوړو څخه نمونه کوو، دوی tokenize، concatenate، او بیا پوډ یا په تصادفيه توګه د روزنې سیسټم اوږدوالی ته وده ورکړي. (د Jia او al. د 2018 کال (د نندارتون او نورو. د 2022 کال). (د سپارښتنه او نور) د 2018 کال (چین او آل. د 2015 کال) (د الوتکو او نور) د 2022 کال د سمندري او د د 2019 کال) (د انټرنیټ او آل، د 2015 کال) 3.3 روبوټیک - RGB Stacking Benchmark (real او sim) لکه څنګه چې په واقعي نړۍ کې د فزیکي فعالیتونو لپاره د ډاټا سیټونه وکاروي، موږ د روبوټ بلاک کڅوړې چاپیریال انتخاب کړ چې [Lee et al.](#_bookmark89) [(2021).](#_bookmark89) د چاپیریال سره د 3-DoF cartesian سرعت کنترول، د سرعت لپاره د اضافي DoF، او یو متناوب gripper عمل لري. د روبوټ ورکشاپ په رنګ کې درې پلاستيکي بلاکونه شامل دي چې رنګونه رنګ شوي دي، سبزې او بیلګې په مختلفو شکلونو کې. (#_bookmark89) د شتونونو شامل دي 128 128 کیمرې انځورونه، روبوټ bracelet او gripper یوځای زاویه او همدارنګه د روبوټ په پای کې د موږ د دې کارونو لپاره د روزنې ډاټا ډیری سرچینې کاروئ. په Skill Generalization، د نمونې او حقیقي لپاره، موږ د غوره generalist sim2real ایجنټ لخوا د معلوماتو کارول موږ یوازې د RGB-Stacking سره اړیکه ونیسئ. (د مجموعي 387k بریالیتوبونه په نمونې کې او 15k په واقعي توګه). د Skill Mastery لپاره موږ د ټیکنالوژۍ په اړه د ټیکنالوژۍ لپاره د ټیکنالوژۍ څخه د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژۍ لپاره د ټیکنالوژ په نمونې کې او په واقعي روبوټ کې د غوره sim2real پالیسۍ څخه (د مجموعي 219k لارښوونې پورې). په یاد ولرئ چې دا ډاټا یوازې په برخه کې د ځانګړي مهارتونو د مدیریت تجربو لپاره شامل دی لی او ال. (2021) په اړه د زده کړې توکي Lee et al. (2021) 5.4 په اړه 4 Capabilities of the generalist agent In this section, we summarize the performance of Gato when trained on the above described data. That is, all results across all tasks are derived from a single pretrained model with a single set of weights. Results with fine-tuning will be presented in Section 5. 4.1 Simulated control tasks فورمه shows the number of distinct control tasks for which Gato performs above a given score threshold, relative to expert performance demonstrated in Gato’s training data. 5 We report performance as a percentage, where 100% corresponds to the per-task expert and 0% to a random policy. For each simulated control task we trained our model on, we roll out the Gato policy on the corresponding environment 50 times and average the defined scores. As shown in Figure Gato performs over 450 out of 604 tasks at over a 50% expert score threshold. 5, In ALE Atari Gato achieves the average human (or better) scores for 23 Atari games , achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section where we present a specialist single domain ALE Atari agent achieving better than human scores for 44 games). (Bellemare et al., 2013) 1 5.5 د ماشوم Gato achieves over 80% of expert score for nearly all levels . For the most difficult task, called BossLevel, Gato scores 75%. The two other published baselines we could find, BabyAI 1.0 and BabyAI 1.1 , د 77٪ او 90٪ کچې کچې، د دې واحد کار یوازې د یو میلیون ډیموډونو په کارولو سره روزل شوي دي. (Chevalier-Boisvert et al., د 2018 کال 2 (Hui et al. 2020), On Meta-World (Y Gato achieves more than 50% for all 44 out of 45 tasks that we trained on, over 80% for 35 tasks, and over 90% for 3 tasks. On canonical DM Control Suite (T Gato achieves better than 50% of the expert score on 21 out of 30 tasks from state, and more than 80% for 18 tasks. u et al., 2020) د انټرنټ او ال. د 2018 کال) 4.2 Robotics First person teleoperation enables the collection of expert demonstrations. However, such demonstrations are slow and costly to collect. Data-efficient behavior cloning methods are therefore desirable for training a generalist robot manipulator and offline pretraining is thus a well-motivated area of research. To that end, we evaluated Gato on the established RGB Stacking benchmark for robotics. د مهارتونو د عمومي کړنو The Skill Generalization challenge from the RGB Stacking robotics benchmark tests the agent’s ability to stack objects of previously unseen shapes. The agent is trained on a dataset consisting of episodes of the robot stacking objects with a variety of different shapes. Five triplets of object shapes are, however, not included in the training data and serve as test triplets. We evaluated the trained generalist for 200 episodes per test triplet on the real robot. Table ښيي چې زموږ generalist ایجنټ د هر ازموینې triplet کې د بریالیتوب کچه د واحد کار BC-IMP (د فلټر شوي BC) اصلي لړۍ په پرتله ده. 2 لی او ال. (2021). 4.3 Text samples The model demonstrates rudimentary dialogue and image captioning capabilities. Figure شامل دي د Gato د انځور لاندې کړنو rep-resentative نمونې. انځور shows some hand-picked examples of plain text dialogue exchange. 6 7 5 تحلیل 5.1 Scaling Laws Analysis په فورمه موږ د پارامترونو په شمول د پروټینډ ماډل په پراخه کچه د ترانسپورت په شمول د مجموعي فعالیتونو تحلیل کوو ترڅو پوه شي چې څنګه د موډل د ظرفیت زیاتولو سره د فعالیتونه کولی شي ښه وي. موږ د 3 مختلفو ماډل اندازهونو (د پارامترونو شمول په شمول) تبادله کوو: د 79M ماډل، د 364M ماډل، او د 1.18B ماډل (Gato). موږ په برخه کې اشاره کوو for details on the three model architectures. 8، C Here, for all three model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 1). Then for each domain listed in Table we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a significant performance improvement with increased scale. 4. 1 5.2 Out of distribution tasks په دې برخه کې موږ غواړو چې د لاندې پوښتنې ځواب ورکړو: For this reason, we held-out all data for four tasks from our pre-training set: cartpole.swingup (DM Control Suite domain), assembly-v2 (Meta-World domain), order_of_apples_forage_simple (DM Lab domain), and boxing (ALE Atari domain). These four tasks will serve as testbeds for evaluating the out-of-distribution capabilities of Gato. Can our agent be used to solve a completely new task efficiently? Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment. Fine-tuning is very similar to pretraining with minor changes, such as different learning rate schedule; see Section for details. E We want to measure how choice of data used during pretraining influences post-fine-tuning performance. To this end, we compare Gato (trained on ) په ablated ډاټا سیټونو کې روزل شوي Variants ته: all data 1. یو ماډل چې یوازې د ورته ډومین څخه د معلوماتو په اړه مخکښ شوی دی چې د کار په توګه د ګمرکولو لپاره، . same domain only data 2. A model pretrained only on non-control data, . no control data 3. A model fine-tuned from scratch, i.e. no pretraining at all, . د سکرین Considering as all these experiments require training a new model from scratch and then also fine-tuning, we present results using the less compute-intensive 364M parameter architecture described in Section د پایلو په انځور کې ښيي 5.1. 9. د cartpole.swingup او assembly-v2 کارونو په ګډه، کوم چې د انځور پروسس اړتیا نلري، د ورته ټینګونه لري. په ټولو ډاټا سیټونو کې Pre-Training غوره پایلې ورکوي، وروسته یوازې په ورته ډومین کې Pre-Training. دا فرق د assembly-v2 لپاره کوچنی دی، مګر د ټیم ډاټا سیټونو په ګډه. د دې غیر تصویر پر بنسټیز چاپیریالونو لپاره، موږ د Pre-Training په ګډه نه ګټې (cartpole.swingup) یا حتی منفي انتقال (assembly-v2) وګورئ ډاټا سیټونه، چې یوازې انځورونه او متن ډاټا لري. no control Results for DM Lab order_of_apples_forage_simple are slightly different. Pretraining on DM Lab data only is already enough to approach the maximum reward of 19 and hence there is no observable benefit of adding data from different environments. What is different when compared to previously analysed no-vision environments is that pretraining on data helps, which can be possibly explained by the fact that agents in the DM Lab environment are fed images which, despite being simulated, are natural looking. Therefore, transfer from image captioning or visual grounded question answering tasks is possible. no control We were not able to observe any benefit from pretraining on boxing. The randomly initialized model seems to work better than any of the pretrained variants considered. We hypothesise that this is caused by the game’s input images being visually very distinct from the other data, suggesting transfer is difficult. We discuss this Atari challenge further in our related work section. 5.3 Fine-tuning on Robotic Stacking Tasks Section په دې برخه کې، موږ غواړو چې د دې پوښتنې ځواب: *د څرنګه چې زموږ ایجنټ د روبوټیک کارونو په اړه ښه کوي کله چې اجازه ورکوي چې د RGB Stacking Skill Generalization معیار په پرتله ښه کړي په ورته ډول چې موږ په برخه کې د نوي کارونو په پرتله ښه کړي * موږ د مختلفو ماډل اندازهونو په پام کې ونیسئ او د Pre-training ډاټا سیټونو د Skill Generalization بیلګې په اړه د اغېز په اړه تحلیل کوو، او همدارنګه د توزیع ورکشاپ څخه د نوښت. د ډاټا سیټونو ablations سره د fin-tuning نورو تحلیلونه شتون لري 4.2 5.2? I. Skill Generalization First, we would like to show that fine-tuning on object-specific data, similarly to what was done by is beneficial. Therefore, we fine-tuned Gato separately on five subsets of demonstrations from the dataset. Each subset was obtained by random partitioning of a test dataset consisting of demonstrations gathered by a generalist sim-to-real agent stacking real test objects. We consider this setting, which is comparable to the fine-tuning baselines on RGB stacking tasks from and use the 5k dataset that their behavior cloning 5k results are obtained with. To best match their experiments, we change our return filtering scheme during training: instead of using only successful stacks, we condition on the normalized return of the episode. Lee et al. (2028) د test (Lee et al., 2022); Figure compares the success rate of Gato across different fine-tuning data regimes to the sim-to-real expert and a Critic-Regularized Regression (CRR) agent trained on 35k episodes of all test triplets. Gato, in both reality and simulation (red curves on the left and right figure, respectively), recovers the expert’s performance with only 10 episodes, and peaks at 100 or 1000 episodes of fine-tuning data, where it exceeds the expert. After this point (at 5000), performance degrades slightly but does not drop far below the expert’s performance. 10 (Wang et al., 2020) Fine-tuning and Model Size To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure د بشپړ 1.18B پارامتر Gato سره د 364M او 79M پارامتر ویشنونو سره په مختلفو ډاټاونو کې د ټینګینګ ډاټا لپاره مقایسه کوي. که څه هم د 364M ماډل په یوه انډول کې د یو انډول په پرتله کار کوي، له دې امله د فعالیت کم کړي، دا د کوچني انډولونو سره د ښه وړتيا لپاره یو ښکلي ټینډ شتون لري ځکه چې د پارامترونو شمېر کچول کیږي. د 79M ماډل په ښکاره توګه د خپل لوی همکارانو په پرتله بدتر کار کوي. د پایلو ښکاره کوي چې د ماډل لوی ظرفیت اجازه ورکوي چې د موډل د ټیسټ وخت کې د مختلفو روزنې ډاټا څخه زده شوي وړاندیزونه کاروي. 10 Adaptation to Perceptual Variations While the Skill Generalization task is an effective benchmark for motor Skill Generalization to shape varia-tions, it does not test the agent’s ability to adapt to perceptual variations and permutations in the objective specification. To further evaluate Gato’s generalization capabilities, we devised a new task in the RGB stacking benchmark where the goal is to stack the blue object on the green object, for test triplet 1 (see Figure First, we used a 3D mouse to collect 500 demonstrations of this task on the real robot, for a total of 2 hours and 45 minutes of demonstration data, and fine-tuned Gato on these episodes. Notably, all of the simulated and real robotics data in the pretraining set shows the robot successfully stacking the red object on the blue object, and the data does not include the object shapes in the test set. We found that additionally adding simulated demonstrations of the stack blue on green task to the fine-tuning dataset improved performance, and 10% was an ideal sampling ratio for this data. 11). We achieved a final 60% success rate after evaluating fine-tuned Gato on the real robot, while a BC baseline trained from scratch on the blue-on-green data achieved only 0.5% success (1/200 episodes). Qualitatively, the BC baseline would consistently move towards the blue object and occasionally pick it up and place it on top of the green object, but a full, stable stack was almost never achieved. 5.4 Robotics: Skill Mastery Similarly to the Skill Generalization challenge discussed in Section the Skill Mastery challenge consists in training a robotic arm to stack blocks of different shapes. However, the Skill Mastery allows the agent to train on data involving the object shapes used for evaluation, i.e. the set in Skill Generalization becomes a part of the Skill Mastery set. Thus, this challenge serves to measure Gato’s performance on in-distribution tasks (possibly with initial conditions not seen in the training demonstrations). Our Skill Mastery results use an earlier version of the Gato architecture described in Appendix with no fine-tuning. 4.2, test training H, Table compares the group-wise success percentage and the average success across object groups for Gato and the established BC-IMP baseline. Gato exceeds or closely matches BC-IMP’s performance on all but one training triplet. 3 5.5 Specialist single-domain multi-task agents In this section we show results obtained with two specialist (rather than generalist) agents. Both of them were trained on data from a single domain only and rolled out 500 times for each training task without any per-task fine-tuning. Meta-World The first agent uses the smallest architecture introduced in Section i.e. 79M parameters, and is trained on all 50 Meta-World tasks. While Gato has access to the state of the MuJoCo physics engine and unlimited task seeds, the agent presented here has no access to any extra features or tasks and uses the canonical API as in (Y This experiment is to show that the architecture proposed in our paper can be used to obtain state-of-the-art agents also at small scale. The training procedure was to train single-task MPO experts on each of the MT-50 tasks individually, recording the trajectories produced while training. This experience is then combined, or distilled, into a single agent, which achieves 96.6% success rate averaged over all 50 tasks. To the best of our knowledge this agent is the first one to accomplish nearly 100% average success rate simultaneously (multi-task) for this benchmark. See Table in the supplementary material (Section for the full list of tasks and corresponding success rates of our agent. 5، 1 تاسو او AL. 2020). (Abdolmaleki et al., 2018) 7 K) ALE Atari We also trained a specialist agent on all 51 ALE Atari tasks. As the Atari domain is much more challenging than Meta-World, we used the Gato architecture with 1.18B parameters. The resulting agent performs better than the average human for 44 games (see Section for details on our evaluation and scoring). We want to note that the performance of online experts used to generate training data for the other 7 games were also below the average human. Hence, the specialist Atari agent achieved better than human performance for all games where data contained super-human episodes. 4.1 The specialist Atari agent outperforms our generalist agent Gato, which achieved super-human performance on 23 games. It suggests that scaling Gato may result in even better performance. We, however, purposely restricted Gato’s size such that it can be run in real-time on the real robot. 5.6 Attention Analysis We rendered the transformer attention weights over the image observations for various tasks, to gain a qualitative sense of how Gato attends to different regions of the image across tasks (see Figure Further details and visualizations for more tasks can be found in Appendix These visualizations clearly show that attention tracks the task-relevant objects and regions. 12). J. 5.7 Embedding Visualization To understand how Gato encodes differently information per task, we visualized per-task embeddings. We analysed 11 tasks. For each task, we randomly sample 100 episodes and tokenize each of them. Then, from each episode we take a subsequence of 128 tokens, compute their embeddings (at layer 12, which is half the total depth of the transformer layers) and average them over the sequence. The averaged embeddings for all tasks are used as input to PCA, which reduces their dimensionality to 50. Then, T-SNE is used to get the final 2D embeddings. Figure shows the final T-SNE embeddings plotted in 2D, colorized by task. Embeddings from the same tasks are clearly clustered together, and task clusters from the same domain and modality are also located close to each other. Even held-out task (cartpole.swingup) is clustered correctly and lays next to another task from DM Control Suite Pixels. 13 6 Related Work The most closely related architectures to that of Gato are Decision Transformers , and Trajectory Transformer which showed the usefulness of highly generic LM-like architectures for a variety of control problems. Gato also uses an LM-like architecture for control, but with design differences chosen to support multi-modality, multi-embodiment, large scale and general purpose deployment. Pix2Seq also uses an LM-based architecture for object detection. Perceiver IO ., uses a transformer-derived architecture specialized for very long sequences, to model any modality as a sequence of bytes. This and similar architectures could be used to expand the range of modalities supported by future generalist models. (Chen et al., 2021b; Reid et al., د 2022 کال Zheng et al., د 2022 کال Furuta et al. 2021) (Janner et al., 2021), (Chen et al., 2022) (د جینګ او ال) 2021) Gato د GPT-3 په څیر کارونو له خوا اغیزمن شوی and Gopher pushing the limits of generalist language models; and more recently the Flamingo generalist visual language model. developed the 540B parameter Pathways Language Model (PalM) explicitly as a generalist few-shot learner for hundreds of text tasks. (Brown et al., 2020) (د رایو او نورو، 2021), (Alayrac et al., 2022) Chowdhery et al. (2022) په راتلونکي کار کې باید په بحث کې وي چې څنګه د دې متن وړتیاوې په یو بشپړ generalist ایجنټ کې یوځای شي چې په واقعي نړۍ کې، په مختلفو چاپیریالونو او اغیزې کې هم په واقعي وخت کې عمل کولی شي. Gato also takes inspiration from recent works on multi-embodiment continuous control. used message passing graph networks to build a single locomotor controller for many simulated 2D walker variants. showed that transformers can outperform graph based approaches for incom-patible (i.e. varying embodiment) control, despite not encoding any morphological inductive biases. learn a modular policy for multi-task and multi-robot transfer in simulated 2D manipulation environments. train a universal policy conditioned on a vector representation of robot hardware, showing successful transfer both to simulated held out robot arms, and to a real world sawyer robot arm. Huang et al. (2020) Kurin et al. (2020) Devin et al. (2017) Chen et al. (2018) A variety of earlier generalist models have been developed that, like Gato, operate across highly distinct domains and modalities. NPI trained a single LSTM to execute diverse programs such as sorting an array and adding two numbers, such that the network is able to generalize to larger problem instances than those seen during training. developed the MultiModel that trains jointly on 8 distinct speech, image and text processing tasks including classifica-tion, image captioning and translation. Modality-specific encoders were used to process text, images, audio and categorical data, while the rest of the network parameters are shared across tasks. proposed “ ”, describing a method for the incremental training of an increasingly general problem solver. proposed controllable multi-task language models that can be directed according to language domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. (Reed & De Freitas, 2016) (Hochreiter & Schmidhuber, 1997) Kaiser et al. (2017) Schmidhuber (2018) د هر څه لپاره یو لوی شبکې د نندارتون او ال. (2019) In this discussion, it is important to distinguish between one single multi-task network architecture versus one single neural network with the same weights for all tasks. Several poplar RL agents achieve good multi-task RL results within single domains such as Atari57 and DMLab However, it is much more common to use the same policy architecture and hyper-parameters across tasks, but the policy parameters are different in each task This is also true of state-of-the-art RL methods applied to board games Moreover, this choice has been adopted by off-line RL benchmarks and recent works on large sequence neural networks for control, including decision transformers and the Trajectory Transformer of In contrast, in this work we learn a single network with the same weights across a diverse set of tasks. (Espeholt et al., 2018; Song et al., 2020; Hessel et al., 2019). (Mnih et al., 2015; Tassa et al., 2018). (Schrittwieser et al., 2020). (Gulcehre et al., 2020; Fu et al., 2020) (Chen et al., 2021b; Reid et al., 2022; Zheng et al., 2022) Janner et al. (2021) په اړه Recent position papers advocate for highly generalist models, notably proposing one big net for everything, and on foundation models. However, to our knowledge there has not yet been reported a single generalist trained on hundreds of vision, language and control tasks using modern transformer networks at scale. Schmidhuber (2018) Bommasani et al. (2021) “Single-brain”-style models have interesting connections to neuroscience. famously stated that “ ”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence Mountcastle (1978) the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex (Hawkins & Blakeslee, 2004). Sensory substitution د یو واحد ماډل لپاره یو بل arguments وړاندې کوي For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect. (د بیچ-ی ریتا & Kercel، 2003). زموږ کار د عمیق autoregressive ماډلونو پر بنسټ دی، چې د اوږد تاریخ لري او کولای شي د متن، انځورونه، ویډیو او غږ generative ماډلونو کې ونیسئ. has been of enormous impact in language modelling protein folding vision-language models (T code generation dialogue systems with retrieval capabilities speech recognition neural machine translation and more , Recently researchers have explored task decomposition and grounding with language models aswani او د 2017; Devlin et al., 2018) (Brown et al., 2020; Rae et al., د 2021 کال) (Jumper et al., د 2021 کال) simpoukelli et al., 2021; Wang et al., 2021; Alayrac et al., 2022), (Chen et al., 2021c; لی او ال. 2022b), (Nakano et al., 2021; Thoppilan et al., 2022), (Pratap et al., 2020), (Johnson et al., 2019) (Bommasani et al. 2021). (Huang et al., 2022; Ahn et al., 2022). construct a control architecture, consisting of a sequence tokenizer, a pretrained language model and a task-specific feed-forward network. They apply it to VirtualHome and BabyAI tasks, and find that the inclusion of the pretrained language model improves generalisation to novel tasks. Similarly, ښيي چې د بصری ماډلونه د ځان د څارنې سره مخکښ شوي، په ځانګړي ډول د زراعت سیسټمونو او د تشناب تناوب کولی شي په اغیزمنه توګه په کنترول سیاستونو کې شامل شي. لی او ال. (2022a) Parisi et al. (2022) (He et al., 2020), As mentioned earlier, transfer in Atari is challenging. researched transfer between ran-domly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Rusu et al. د 2016 کال Kanervisto et al. (2020). There has been great recent interest in data-driven robotics However, note that in robotics “ ”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data. (Cabi et al., 2019; Chen et al., 2021a). د نندارتون او ال. (2021) the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments د خودکشی ماډل په کارولو سره عملونه تولید کول کولی شي له امله د "سایلو دنده" مخنیویونه ته ورسيږي کله چې د بدلونونو په مخنیوی کې دي. For example, sampling actions can condition the model to solve the wrong task when multiple tasks share similar observation and actions specifications. As explained in Section we use prompt engineering in ambiguous tasks, conditioning our model on a successful demon-stration. This screens off confounding variables, reducing self-delusions. Another solution which we did not explore in this work is to use counterfactual teaching, where we train a model online using instantaneous expert feedback. We leave this for future investigation. (د فورمه او نورو، 2021). 2, 7 Broader Impact که څه هم عمومي اجناسونه یوازې د څیړنې په پرمختللي سيمه کې شتون لري، دوی د ټولنیز اغیزې ته اړتيا لري د دوی د خطرونو او فوټانونو په پراخه کچه تبادله کړي. د شفافیت لپاره، موږ د ګټو د غوښتنلیک په نمونوي کارت کې د شفافیت په لټه کې شامل دي. However, the tools for mitigating harms of generalist agents are relatively underdeveloped, and require further research before these agents are deployed. A. لکه څنګه چې زموږ generalist agent کولی شي د بصری لغت ماډل په توګه عمل وکړي، دا د ورته نگرانیونو وارث کوي لکه څنګه چې په بحث شوي دي. In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. (Wei-dinger et al., 2021; Bommasani et al., 2021; Rae et al., د 2021 کال Alayrac et al., 2022). Technical AGI safety may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (R are especially important for the design of human-compatible generalist agents. It may be possible to extend some of the value alignment approaches for language to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight This limitation underscores the need for a careful design and a deployment process that incorporates multiple disciplines and viewpoints. (Bostrom, 2017 کال ussell, د 2019 کال) (د هانګ او نورو، 2022; Kenton et al., 2021) (Amodei et al., د 2016 کال). Understanding how the models process information, and any emergent capabilities, requires significant ex-perimentation. External retrieval has been shown to improve both interpretability and performance, and hence should be consid-ered in future designs of generalist agents. (Borgeaud et al., 2021; Menick et al., 2022; Nakano et al., 2021; Thoppilan et al., 2022) Although still at the proof-of-concept stage, the recent progress in generalist models suggests that safety researchers, ethicists, and most importantly, the general public, should consider their risks and benefits. We are not currently deploying Gato to any users, and so anticipate no immediate societal impact. However, given their potential impact, generalist models should be developed thoughtfully and deployed in a way that promotes the health and vitality of humanity. 8 Limitations and Future work 8.1 RL data collection Gato is a data-driven approach, as it is derived from imitation learning. While natural language or image datasets are relatively easy to obtain from the web, a web-scale dataset for control tasks is not currently available. This may seem at first to be problematic, especially when scaling Gato to a higher number of parameters. که څه هم، د دې ستونزو په اړه په پراخه کچه څیړنې شتون لري. Offline RL هدف د موجودو کنترول ډاټاټاټونو د ګټورولو لپاره دی، او د هغې ترټولو محبوبیت د ډیرو پراخ ډاټاټاټونو شتون لري. د ګټور چاپیریالونو او نمونې جوړ شوي دي (د مثال په توګه، Metaverse)، او د کاروونکو په زیات کچه د دوی سره د ډیرو ډاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټاټ (Baker et al., 2022). Thanks to online video sharing and streaming platforms such as Youtube and Twitch, observation-only datasets are not significantly more difficult to collect than natural language datasets, motivating a future research direction to extend Gato to learn from web data. While the previous paragraph focuses on alleviating drawbacks of data collection from RL agents, it is important to note that this approach presents a different set of tradeoffs compared to scraping web data and can be actually more practical in some situations. Once the simulation is set up and near SOTA agent trained, it can be used to generate massive amounts of high quality data. That is in contrast to the quality of web data which is notorious for its low quality. In short, we believe that acquiring suitable data is another research question on its own, and this is an active area of research with growing momentum and importance. 8.2 Prompt and short context Gato is prompted with an expert demonstration, which aids the agent to output actions corresponding to the given task. This is particularly useful since there is otherwise no task identifier available to the agent (that is in contrast to many multi-task RL settings). Gato infers the relevant task from the observations and actions in the prompt. However, the context length of our agent is limited to 1024 tokens which translates to the agent sometimes attending to only a few environment timesteps in total. This is especially the case for environments with image observations, where depending on the resolution each observation can result in more than one hundred tokens each. Hence for certain environments only a short chunk of a demonstration episode fits in the transformer memory. Due to this limited prompt context, preliminary experiments with different prompt structures resulted in very similar performance. Similarly, early evaluations of the model using prompt-based in-context learning on new environments did not show a significant performance improvement compared to prompt-less evaluation in the same setting. Context-length is therefore a current limitation of our architecture, mainly due to the quadratic scaling of self-attention. Many recently proposed architectures enable a longer context at greater efficiency and these innovations could potentially improve our agent performance. We hope to explore these architectures in future work. 9 Conclusions Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. In the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch. Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent. Acknowledgments We would like to thank Dan Horgan, Manuel Kroiss, Mantas Pajarskas, and Thibault Sottiaux for their help with data storage infrastructure; Jean-Baptiste Lespiau and Fan Yang for help on concurrent evalua-tion; Joel Veness for advising on the model design; Koray Kavukcuoglu for helping inspire the project and facilitating feedback; Tom Erez for advising on the agent design and task selection for continuous control; Igor Babuschkin for helping code the initial prototype; Jack Rae for advising on the transformer language model codebase; Thomas Lampe for building robot infrastructure and advising on real robotics experiments; Boxi Wu for input on ethics and safety considerations; Pedro A. Ortega for advice in regard to causality and self-delusion biases. Author Contributions د پروژې مفهوم پراختیا، د لومړني پروتوټپ لیکل، او د پروژې په عمومي توګه لارښوونې. led architecture development for vision and text, built infrastructure for tokenization and prompting, and contributed heavily to overall agent development and evaluation. Scott Reed Konrad Żołna led work on optimizing the transformer architecture, ran the largest number of experi-ments, and analyzed scaling law properties and in-distribution agent performance. Emilio Parisotto was the technical lead, responsible for creating a scalable data loader and evaluator supporting hundreds of tasks at once, and for the initial robot integration with Gato. Sergio Gómez Colmenarejo developed the model including the sampler for the initial prototype, carried out ex-periments focusing on robotics, and created visualizations. Alexander Novikov جوړ شوی دی scalable ذخیره کولو د انټرنېټ د Gato د SoTA کچه د افسر expe-rience په Atari او نورو ډومینونو برابر کړي. Gabriel Barth-Maron conducted large scale agent data collection, built substantial data loading infrastructure, and integrated large scale visual-language datasets into the training of Gato. Mai Giménez contributed broadly to the Gato codebase including a bespoke distributed training sequence loader, and led the development of benchmarks for out-of-distribution generalization, and the training of competitive baseline agents. Yury Sulsky supported physical robotics infrastructure, conducted numerous evaluations and experiments to analyze the generalization properties of Gato, and contemplated broader ethical impact. Jackie Kay guided Gato’s deployment to the physical robot, provided strong existing base-lines for block stacking, and advised on model development and experimental design. Jost Tobias Springenberg developed the Gato dialogue and image captioning demonstrations, allowing users to easily probe the vision and language capacities of agents in development. Tom Eccles contributed to agent design as well as control datasets and environments with randomized physics and morphology variations. Jake Bruce helped in exploring vision architectures. Ali Razavi د Atari په کارولو لپاره د Gato لومړي پروټوکټ جوړولو کې مرسته وکړ، په اضافي توګه د بدیل شبکې آرکټیکټیکونو او روزنې هدفونو په څیړنه کې. Ashley Edwards advised on agent design, experiment design and task selection, especially for continuous control applications. Nicolas Heess advised on model design and experiments, and provided feedback in regular meetings. Yutian Chen advised on the design and planning of robotics efforts. Raia Hadsell advised on all aspects of the project, especially model architecture, training strategies and benchmark design. Oriol Vinyals was the primary project manager; eliciting key goals, tracking progress, facilitating pre-sentations and feedback, and coordinating resource planning. Mahyar Bordbar oversaw the project from its inception. Nando de Freitas د نندارتون Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation. د 2018 کال Preprint arXiv:1806.06920 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. , 2020. Preprint arXiv:2005.00928 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. , 2022. Preprint arXiv:2204.01691 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. , 2022. Preprint arXiv:2204.14198 Dario Amodei، Chris Olah، Jacob Steinhardt، Paul F. Christiano، John Schulman، او Dan Mané. د AI خوندیتوب په ځانګړي ستونزې. , 2016. د چاپولو لپاره arXiv:1606.06565 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In , pp. 2425–2433, 2015. د کمپيوټر بصری نړیوال کنفرانس Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. د 2016 کال. Preprint arXiv:1607.06450 Paul Bach-y Rita and Stephen W Kercel. Sensory substitution and the human-machine interface. , 7(12):541–546, 2003. Trends in cognitive sciences Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. , 2022. Preprint arXiv::2206.11795 Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. , 2018. Preprint arXiv:1804.08617 Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. , 2016. Preprint arXiv:1612.03801 Marc G Bellemare، Yavar Naddaf، Joel Veness، او مایکل Bowling. د آرکډ زده کړې چاپیریال: د عمومي افسرونو لپاره د تبادلې پلیټ فارم. , 47:253–279, 2013. Journal of Artificial Intelligence Research Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. په 2021 کې. Preprint arXiv:2108.07258 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. په 2021 کې. Preprint arXiv:2112.04426 نیک Bostrom . Dunod, 2017. Superintelligence Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. , 2016. Preprint arXiv:1606.01540 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. In , pp. 1877–1901, 2020. Advances in Neural Information Processing Systems Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. , 2019. Preprint arXiv:1909.12200 Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the-wild" human videos. , 2021a. Preprint arXiv:2103.16817 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar-avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. , 34, 2021b. Advances in Neural Information Processing Systems Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. , 2021c. Preprint arXiv:2107.03374 Tao Chen, Adithyavairavan Murali, and Abhinav Gupta. Hardware conditioned policies for multi-robot transfer learning. , 31, 2018. Advances in Neural Information Processing Systems Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In په 2022 کې. ICLR Xinlei Chen، Hao Fang، Tsung-Yi Lin، Ramakrishna Vedantam، Saurabh Gupta، Piotr Dollár، او C Lawrence Zitnick. د مائیکروسافټ کوکو کپټونه: د معلوماتو د راټولولو او تحلیل سرور. , 2015. Preprint arXiv:1504.00325 Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. د 2018 کال د پروپیلن arXiv:1810.08272 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. , 2022. Preprint arXiv:2204.02311 Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In , pp. 2048–2056, 2020. International Conference on Machine Learning Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ، pp. 2978–2988, 2019 Annual Meeting of the Association for Computational Linguistics Coline Devin، Abhishek Gupta، Trevor Darrell، Pieter Abbeel، او Sergey Levine. د Multi-Task او Multi-Robot انتقال لپاره د ماډولولر نوري شبکې د پالیسۍ زده کړئ. په ، pp. 2169-2176 ، 2017. IEEE International Conference on Robotics & Automation Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. , 2018. Preprint arXiv:1810.04805 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. , 2020. د چاپولو لپاره arXiv:2010.11929 Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. In , pp. 1407–1416, 2018. International Conference on Machine Learning Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. , 2020. Preprint arXiv:2004.07219 Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. په 2021 کې. Preprint arXiv:2111.10364 Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. , 33:7248–7259, 2020. Advances in Neural Information Processing Systems Jeff Hawkins and Sandra Blakeslee. . Macmillan, 2004. On intelligence Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pp. 770–778, 2016a. د IEEE کمپیوټر بصری او نمونې تشخیص Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In , pp. 630–645, 2016b. European Conference on Computer Vision Kaiming He، Haoqi Fan، Yuxin Wu، Saining Xie، او Ross Girshick. د نمونوي تناوب لپاره غیر نظري ښیښه نمونې زده کړې. په , pp. 9729–9738, 2020. IEEE Computer Vision and Pattern Recognition Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). , 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Preprint arXiv:1606.08415 Multi-task deep reinforcement learning سره popart. په , 2019. AAAI Matteo Hessel، Ivo Danihelka، Fabio Viola، Arthur Guez، Simon Schmitt، Laurent Sifre، Theophane Weber، David Silver، او Hado van Hasselt. Muesli: د پالیسۍ د غوره کولو په ګډه پرمختګونه. , 2021. Preprint arXiv:2104.06159 Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. , 9(8):1735–1780, 1997. د عصري کمپیوټر Jordan Hoffmann، Sebastian Borgeaud، Arthur Mensch، Elena Buchatskaya، Trevor Cai، Eliza Rutherford، Diego de Las Casas، Lisa Anne Hendricks، Johannes Welbl، Aidan Clark، او نورو. , 2022. د پروپیلن arXiv:2203.15556 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. , 2016. Preprint arXiv:1603.09382 Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In , pp. 4455–4464, 2020. International Conference on Machine Learning Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. , 2022. Preprint arXiv:2201.07207 David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. , 2020. Preprint arXiv:2007.12770 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. په 2021 کې. Preprint arXiv:2107.14795 Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. , 34, 2021. Advances in Neural Information Processing Systems Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In , pp. 4904–4916, 2021. International Conference on Machine Learning Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In ، pp. 3874–3884, 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. , 596(7873):583–589, 2021. Nature Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. , 2017. Preprint arXiv:1706.05137 Anssi Kanervisto, Joonas Pussinen, and Ville Hautamäki. Benchmarking end-to-end behavioural cloning on video games. In , pp. 558–565, 2020. IEEE conference on games (CoG) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. , 2020. Preprint arXiv:2001.08361 Steven Kapturowski، Georg Ostrovski، John Quan، Remi Munos، او Will Dabney. په توزیع شوي وده زده کړې کې تکرار تجربه. په , 2018. International Conference on Learning Representations Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. , 2021. Preprint arXiv:2103.14659 Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. , 2019. Preprint arXiv:1909.05858 Diederik P. Kingma او Jimmy Ba. Adam: د سټاکاسټیک ګټورولو لپاره یو روش. , 2014. د پروپیلن arXiv:1412.6980 Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In , Annual Meeting of the Association for Computational Linguistics pp. 66–71, 2018. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control. , 2020. Preprint arXiv:2010.01856 Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis, Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In , 2021. Conference on Robot Learning Alex X Lee، Coline Manon Devin، Jost Tobias Springenberg، Yuxiang Zhou، Thomas Lampe، Abbas Abdol-maleki، او Konstantinos Bousmalis. څنګه ستاسو د روبوټ وخت واخلئ: د لید پر بنسټ د روبوټ کنټرول کولو لپاره د kickstarting او offline استازیتوب زده کولو پلونه. , 2022. Preprint arXiv:2205.03353 Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making. , 2022a. Preprint arXiv:2202.01771 Yujia Li، David Choi، Junyoung Chung، Nate Kushman، Julian Schrittwieser، Rémi Leblond، Tom Eccles، James Keeling، Felix Gimeno، Agustin Dal Lago، او نورو. د AlphaCode سره د رقابتي کډ تولید. , 2022b. Preprint arXiv:2203.07814 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. , 2017. Preprint arXiv:1711.05101 Kenneth Marino، Mohammad Rastegari، Ali Farhadi، او Roozbeh Mottaghi. Ok-VQA: د ویزیکي پوښتنې په ځواب کې د بېلابېلو پوهې ته اړتيا لري. په ,pp. 3195–3204, 2019. IEEE Computer Vision and Pattern Recognition Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. , 2022. Preprint arXiv:2203.11147 Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In , pp. 220–229, 2019. د عدالت، حسابولو او شفافیت په اړه د کنفرانس پروسه Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. , 518(7540):529–533, 2015. Nature Vernon Mountcastle. د دماغ دنده سازماني اصل: د واحد ماډل او توزیع سیستم. , 1978. د ذهني دماغ Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. , 2021. Preprint arXiv:2112.09332 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. , 2016. Preprint arXiv:1609.03499 Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. , 2021. Preprint arXiv:2110.10819 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. , 2022. Preprint arXiv:2203.02155 Simone Parisi، Aravind Rajeswaran، Senthil Purushwalkam، او Abhinav Gupta. د کنترول لپاره د مخکښ بصری ماډلونو غیرقانوني اغیزمنتیا. , 2022. د پروپیلن arXiv:2203.03580 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. , 2020. د پروپیلن arXiv:2007.03001 Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. , 30, 2017. Advances in Neural Information Processing Systems Jack W Rae، Sebastian Borgeaud، Trevor Cai، Katie Millican، Jordan Hoffmann، Francis Song، John Aslanides، Sarah Henderson، Roman Ring، Susannah Young، او نورو. د ژرنده لغوي ماډلونه: د روزنې gopher څخه روشونه، تحلیلونه او بصیرتونه. , 2021. د پروپیلن arXiv:2112.11446 Scott Reed and Nando De Freitas. Neural programmer-interpreters. In , 2016. International Conference on Learning Representations Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help offline reinforcement learning? , 2022. Preprint arXiv:2201.12122 Stuart Russell. Penguin، 2019 Andrei A Rusu، Neil C Rabinowitz، Guillaume Desjardins، Hubert Soyer، James Kirkpatrick، Koray Human compatible: Artificial intelligence and the problem of control Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. , 2016. Preprint arXiv:1606.04671 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. In , 2022. International Conference on Learning Representations Jürgen Schmidhuber. One big net for everything. , 2018. Preprint arXiv:1802.08864 Julian Schrittwieser، Ioannis Antonoglou، Thomas Hubert، Karen Simonyan، Laurent Sifre، Simon Schmitt، Arthur Guez، Edward Lockhart، Demis Hassabis، Thore Graepel، et al. د زده کړې ماډل له لارې د atari، go، شطرنج او shogi مدیریت. , 588(7839):604–609, 2020. Nature Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hyper-nymed, image alt-text dataset for automatic image captioning. In ، pp. 2556–2565, 2018 د کمپیوټریال لغاتیک ایسوسی ایشن د کلني کنفرانس Noam Shazeer. Glu ویروسونه د ترانسپورټر ښه کړي. , 2020. Preprint arXiv::2002.05202 H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. In , 2020. د ICLR Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. ، 15(56): 1929-1958، د 2014 کال. Journal of Machine Learning Research Richard Sutton. The bitter lesson. , 13:12, 2019. Incomplete Ideas (blog) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. د 2018 کال Preprint arXiv:1801.00690 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. , 2022. د پروپیلن arXiv:2201.08239 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In ، pp. 5026–5033, 2012 د هوښيار روبوټونو او سیسټمونو په اړه نړیوال کنفرانس Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. , pp. 200–212, 2021. Advances in Neural Information Processing Systems Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. , 6:100022, 2020. Software Impacts Ashish Vaswani، Noam Shazeer، Niki Parmar، Jakob Uszkoreit، Llion Jones، Aidan N Gomez، Łukasz Kaiser، او Illia Polosukhin. مراقبت ټول تاسو ته اړتيا لري. , 30, 2017. Advances in Neural Information Processing Systems Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. , 2021. Preprint arXiv:2108.10904 Ziyu Wang، Alexander Novikov، Konrad Zolna، Josh S Merel، Jost Tobias Springenberg، Scott E Reed، Bobak Shahriari، Noah Siegel، Caglar Gulcehre، Nicolas Heess، او نورو. د انتقالي منظم ریګریشن. , 33:7768–7778, 2020. د عصري معلوماتو پروسس سیسټمونو پرمختګ Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. , 2021. Preprint arXiv:2109.01652 Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. , 2021. Preprint arXiv:2112.04359 Yuxin Wu and Kaiming He. Group normalization. In , pp. 3–19, 2018. د کمپيوټر بصری اروپايي کنفرانس Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In , pp. 1094–1100, 2020. Conference on Robot Learning Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. په 2022 کې. Preprint arXiv:2202.05607 Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. د 2020 کال. د چاپولو لپاره arXiv:2011.13885 Konrad Zolna، Scott Reed، Alexander Novikov، Sergio Gómez Colmenarejo، ډیوډ Budden، Serkan Cabi، Misha Denil، Nando de Freitas، او Ziyu Wang. د کار په اړه د مخالفي تبادلې زده کړې. په , pp. 247–263, 2021. Conference on Robot Learning Supplementary Material A Model card We present a model card for Gato in Table 4. Table 4: We follow the framework proposed in Gato Model Card. (Mitchell et al., 2019). B Agent Data Tokenization Details In this section we provide additional details on our tokenization schemes. Our agent data is sequenced as follows: • are presented to the agent in order of time (timesteps). Episodes • in turn are presented in the following order: Timesteps ([ 1: 1: 1: ]) are ordered lexicographically by key, each item is sequenced as follows: – Observations y k, x m, z n ∗ Text tokens ( 1: ) are in the same order as the raw input text. y k ∗ Image patch tokens ( 1 : ) are in raster order. x m ∗ Tensors ( 1: ) (such as discrete and continuous observations) are in row-major order. z n – (' '); a designated separator token is provided after observations. Separator | – ( 1: ) are tokenized as discrete or continuous values and in row-major order. Actions a A A full sequence of tokens is thus given as the concatenation of data from T timesteps: where L = T(k + m + n + 1 + A) is the total number of tokens. Each floating point element of tensors in the observation sequence is mu-law companded as in WaveNet (د نندارې et al., 2016): د پارامترونو μ = 100 او M = 256 سره. (اگر د افقی ټینسر د عمل په سیټ کې دی، موږ ته اړتیا نلري چې د انډولونو په سیټ کې راټول شي ځکه چې د انډولونو یوازې د ټولو زموږ چاپیریالونو لپاره \[ 1, 1\] په سیټ کې تعریف شوي دي.) ټول انډولونه وروسته له دې چې دوی په سیټ کې راټول کیږي \[ 1, 1\]. په پایله کې، دوی د ډومین \[ 1,1\] کې د مساوي پراختیا بینونو په کارولو سره discretized شوي دي. موږ د 1024 بینونو کاروئ او د پایله کلینګونه بدل کړئ نو دوی د متن ټکینګونو لپاره کارول کیږي. د tokenized پایله نو د \[32000, 33024] په سیټ کې د کلین See Figure and Figure for visualizations of tokenizing and sequencing values (both discrete and con-tinuous) and images. See Section for details about local position encodings referenced in the figures. 14 15 C C Model Architecture C.1 Transformer Hyperparameters Transformer hyperparameters د Gato په جدول کې وړاندې شوي دي We also list the hyperparameters of smaller architecture variants used in Section 5. 5. C.2 Embedding Function The ResNet block uses the v2 architecture contains GroupNorm with 32 groups instead of LayerNorm and GELU activation functions instead of RELU. The block is diagrammed in Figure (He et al., 2016b), (Wu & He, د 2018 کال (Ba et al., 2016), (Hendrycks & Gimpel, د 2016 کال 16. C.3 Position Encodings کله چې ټوکنونه په ټوکن انډولونو کې کارول شوي دي، د ټوکن انډولونو کې د دوو موقعیت کوډونه اضافه شوي دي (که اړتیا لري) د ماډل لپاره د وخت او فضا معلومات برابر کړي. دا لاندې بیان شوي دي. Patch موقعیت کوډونه These position encodings convey information about a patch’s global position within the image from which the patch was extracted. First, the relative row and column intervals of the patch are calculated by normalizing the patch’s pixel intervals by the image resolution. The row and column normalized intervals are then quantized into a vocabulary size (we use 128) and are used to index a row and column table of learnable position encodings. The method in which the quantized row and column intervals are converted into indices depends on whether we are training or evaluating the model: during training a random index is uniformly sampled from the quantized interval, while during evaluation we deterministically take the (rounded) mean of the interval. Once row and column position encoding are retrieved from the embedding table, they are added onto the token embedding produced by the resnet embedding function, as described previously. د دې پروسه په ځانګړې توګه د نمونې لپاره، موږ په انځور [17.](#_bookmark144) یو مثال وړاندې کوو. موږ به د نمونې سره په رنګ کې په رنګ کې وده ورکړي. انځور د 80 64 حل دی او هر نمونې 16 16 دی، کوم چې د 5 4 = 20 پټونه شتون لري. د نمونې پټ په pixels لړۍ interval کې پیل کیږي \[16*,* 32\] او pixels لړۍ interval کې پیل کیږي \[32*,* 64\]. Normalized، د لړۍ interval دی نو \[0*25*,* 0*.*5\] او د لړۍ interval دی \[0*.*4*,* 0*.*6\]. موږ وروسته په انفرادی ډول د intervals په 128 برابره spaced بینونو کې کینټوي، په پایله کې د کینټ شوي لړۍ Local Observation Position Encodings The local observation position encoding adds positional information about where observation tokens are positioned within the local time-step they were an element of. First, we reiterate that, during tokenization, for each time-step all elements of the observation set are tokenized into sequences and concatenated into an observation sequence. Each token in this observation sequence is given an index which corresponds to the sequence order, i.e. the first token is 0 and the last is the length of the observation sequence minus one. After embedding, for any tokens that were a part of an observation set, the corresponding observation token index is used to index an embedding table of learnable position encodings, with one embedding for every possible observation token index (in practice we simply set the table size to a large value like 512). / The position encoding is then added onto the observation token embedding to produce the final token embedding. Note that all action tokens are given the same position encoding regardless of their position in the time-step sequence. We illustrate an example of this process in Figure 18. D مخکښ جوړول For all models we use the AdamW optimizer with a linear warm-up and cosine schedule decay. The linear warmup lasts for 15*,* 000 steps, starting from a learning rate of 1e-7 and ending at a different maximum learning rate depending on the model (see Table This learning rate is then cosine decayed by a factor 10x over 1,000,000 steps. The AdamW optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. We use a batch size of 512 and a sequence length of 1024 tokens for all models. Optimizer: (Loshchilov & Hutter, 2017) 6). β 9, β ϵ We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth د مخکښ روزنې په وخت کې، کله چې د ترانسپورټر زیربنا (یا هر Multi-Head Attention او Dense Feedforward layer) سره د 0.1 احتمال سره نږدې کیږي. Regularization: (د هانګ او آل، د 2016 کال E Fine-tuning Setup For all models we use the Adam optimizer with a constant learning rate of 1e-5. The Adam optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. موږ د ټولو ماډلونو لپاره د 64 ټوټې اندازه او د 1024 ټوکن لړ کاروي. موږ د 10،000 gradient ګامونو لپاره روزنه کوو. Optimizer: (Kingma & Ba, 2014) β 9, β ϵ We use dropout سره د 0.1 کچه. Regularization: (Srivastava et al., 2014) We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The final fine-tuning performance is defined as the maximum of these smoothed scores. Evaluation: We generated data for the fine-tuning tasks the same way we did for the other tasks (see Section 3.1 for details). Instead of using all the data for a fine-tuning task, we discarded all but 2000 best episodes (leading to the highest returns). The fine-tuning datasets were created in the following way. We randomly took 1000 episodes (out of 2000 preselected episodes), then a subset of 100 episodes from the selected episodes, then 10, 5, 3, and finally a single episode. We repeated this procedure 3 times to obtain 3 series of cascading subsets for each task. Each subset is used to conduct one fine-tuning experiment, and each is reported on our plots in Section لکه څنګه چې د انفرادي نقطې. Datasets: 5.2 We have not altered any of the tasks and used their canonical versions. As 3 out of 4 tasks are open sourced, they do not need further explanation. For the fourth task, DMLab order_of_apples_forage_simple, the goal is to collect apples in the right order, green ones first followed by the gold one. Task settings: F د معلوماتو د جمعې تفصيلات F.1 Atari We collect two separate sets of Atari environments. The first (that we refer to as ALE Atari) consists of 51 canonical games from the Arcade Learning Environment The second (that we refer to as ALE Atari Extended) is a set of alternative games with their game mode and difficulty randomly set at the beginning of each episode. (Bellemare et al., 2013). 3 For each environment in these sets we collect data by training a Muesli agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. (د فورمه او نور) 2021) د F2 Sokoban Sokoban is a planning problem in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ahead of time is therefore necessary to succeed at this puzzle. We use a Muesli agent to collect training data. (Racanière et al., 2017), (Hessel et al., 2021) F3 د ماشوم BabyAI د gridworld چاپیریال دی چې په سطحو کې د لارښوونې پیژندنه ورکشاپونه شامل دي چې د مصنوعي زبان له لارې بیان شوي دي. موږ د دې سطحي لپاره د BabyAI بوټ سره ډاټا تولید کوو. د بوټ د اضافي معلوماتو ته لاس رسی لري چې د غوره حلونو ترسره کولو لپاره کارول کیږي. for more details about the bot. We collect 100,000 episodes for each level. (Chevalier-Boisvert et al., 2018) F.4 DeepMind Control Suite The DeepMind Control Suite (T ., is a set of physics-based simulation environments. For each task in the control suite we collect two disjoint sets of data, one using only state features and another using only pixels. We use a D4PG agent to collect data from tasks with state features, and an MPO د pixel په کارولو سره د معلوماتو راټولولو لپاره. د نندارتون او ال د 2020 کال Tassa et al., د 2018 کال (Barth-Maron et al., 2018) (د Abdolmaleki et al. په اړه) 2018) موږ هم د کنترول سیټ د کارولو لپاره د D4PG ایجنټ سره د لګولو نسخهونو لپاره ډاټا جمع کوو. دا نسخهونه په لګولو سره د اوسپنې ګرځنده، جاليو رڼا، ضخامت، او ضخامت، او جیم اندازه او ضخامت. د لګولو نسخهونو لپاره دوه ستونزه تنظیمونه شتون لري. د کوچني تنظیمونه د intervals union [0* څخه د لګولو شمېر لګولو لګولو لګولو لګوي. ,* 0*. [نور 1] او * 1*. . ,* 0*. [نور 1] ، * 1 * * 4 . 9 95) 05 1]. The large setting scales values by a random number sampled from the union of intervals [0 6 8] 2 F.5 DeepMind Lab DeepMind لابراتوار , دا یو لومړي شخص 3D چاپیریال دی چې د افسرونو د 3D بصیرت څخه د خام پیکسل انډولونو سره د ځان تمرکز نندارتون، لارښوونې او پلانونه زده کړي. (Beattie et al. د 2016 کال موږ د IMPALA زده کړې د 18 والدین DM لابراتوار کټګورۍ په ګډه کې چې د هر نوی انډول لپاره پروسه کارپوه تولید کړي. د معلوماتو د 18 کټګورۍ کې د ایجنټ executing لخوا راټول شوي، او همدارنګه د 237 کټګورۍ اضافي کټګورۍ چې د مختلفو مهارتونو د ازموینې لپاره کارول شوي دي. (د فورمه او نور) د 2018 کال د 18 والدین کچه د تولید شوي نقشهونو لوړ متنوعاتو لخوا ځانګړتیا لري. د کچه تر منځ د فرق په یو تولید پروسه کې کارول شوي hyper-پرامیټرونو کې ریښتیا لري. دا hyper-پرامیټرونه د لوړ کچه ځانګړتیاوې لکه جوړښتونو ډولونه، د ژورې لارښوونې ستونزه، یا ځانګړي وسایلو شتون لري کنټرول کوي. د والدین کچه د انټرنیټ کې روزل شوي RL انجنونو فعالیتونو د ښه کولو لپاره جوړ شوي دي. په پرتله د والدین کچه، د اضافي د دستکشو 237 کچه هر ډول تقریبا ورته نقشه کاروي، او د ورته کچه د نقشه انډولونو تر منځ اصلي فرقونه esthetics لکه دیوالونو رنګونه یا رڼا شرایط دي. د پروسه توليد شوي او طرح شوي دي چې د مختلفو مهارتونو د ازمايښت په څیر لکه د پړاو لرې یا د ځانګړي وسایلو په کارولو سره. دوی د انځور 3، انځور 7 او انځور 8 په دې کاغذ کې ښودل شوي سطحيونو سره شتون لري not بوتل او ال. په 2016 کال کې. د 18 والدین کچه په اړه اضافي معلومات (او د نورو کچه سره د دوی اړیکو) په تفصیل کې د NeurIPS ورکشاپ خبرې کې وړاندې کیږي by Daniel Tanis . د RL د چاپیریال څیړنې لپاره د ایکو Methodology 4 In total, we collected data for 255 levels from the DeepMind Lab (18 parent levels and 237 handcrafted levels), 254 of which were used while training Gato. The remaining level was used for out of distribution evaluation. د F6 Procgen Benchmark Procgen دا د 16 پروسه توليد شوي اتاریو په څیر چاپیریالونو سیټ دی، کوم چې د استوګنې زده کړې کې د نمونې اغیزمنتیا او عمومي کولو لپاره پیشنهاد شوي. د معلوماتو د راټولولو په وخت کې د R2D2 روزنې ترسره شوی موږ د هر چاپیریال په اړه د سخت ستونزو تنظیم کوو د ټولو چاپیریالونو لپاره پرته له لابراتوارونو او سرکټونو، چې موږ ته آسانه کوو. (د کلب او نورو، د 2020 کال) (د کارپوهانو او همکارانو، د 2018 کال F.7 ماډولر RL Modular RL دی د MuJoCo (T based continuous control environments, composed of three sets of variants of the OpenAI Gym Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of morphologies is generated by enumerating all possible subsets of limbs, and keeping only those sets that a) contain the torso, and b) still form a connected graph. This results in a set of variants with different input and output sizes, as well as different dynamics than the original morphologies. We collected data by training a single morphology-specific D4PG agent on each variant for a total of 140M actor steps, this was done for 30 random seeds per variant. (د هانګ او آل، د 2020 کال) odorov et al., د 2012 کال) (Brockman et al., د 2016 کال F.8 DeepMind Manipulation Playground The DeepMind Manipulation Playground is a suite of MuJoCo based simulated robot tasks. We collect data for 4 of the Jaco tasks (box, stack banana, insertion, and slide) using a Critic-Regularized Regression (CRR) agent trained from images on human demonstrations. The collected data includes the MuJoCo physics state, which is we use for training and evaluating Gato. (Zolna et al., 2021) (د هانګ او نورو، 2020) F.9 میټا نړۍ Meta-World (Y is a suite of environments د میتا-رباول کولو زده کړې او د ډیرو ورکشاپ زده کړې لپاره. موږ د MT50 موډل کې د ټولو روزنې او ازمايښتونو څخه ډاټا راټول کوو د MPO ایجنټ روزنې له لارې د غیر محدود چاپیریال نباتاتو سره او د MuJoCo فزیکي انجن حالت ته دسترسی سره. د جمع شوي معلوماتو هم د MuJoCo فزیکي انجن حالت شامل دي. u et al., د 2020 کال) 5 (د Abdolmaleki et al. په اړه) 2018) G د واقعي روبوټیک ارزیابی تفصيلات په واقعي نړۍ کې، کنترول اسانه دی؛ فیزیک د محاسبې بشپړولو لپاره انتظار نه کوي. په دې توګه، د واقعي نړۍ د کارونو لپاره د لوی ماډل ارزښت کولو لپاره د پایلو لټینټ دی. په روبوټیکس کې، د سرعت کنترول کچه د ډینامیک فایبرونو ته ځواب ورکولو لپاره مهم دی. د RGB پټولو لپاره روبوټ نصب د 20Hz کنترول کچه (0.05 ثانیو وخت) په ډیزاین کې لري. د لټینټ لپاره د لټینټ وړ margin ته ورسیږي، موږ د کنترول وخت په توګه د کنترول اوږدوالی کمولو سره د 1 لپاره د کنترول کچه تعدیل شوي دي. موږ هم د دوامداره نمونې سیسټم پیژندل شوي چې ټول عمل ټوکنونه د روزنې We use the sparse reward function described in د معلوماتو فلټرولو لپاره. موږ یوازې د سفرونو سره په پایله وخت کې د کار د بریالیتوب؛ دا ده، د 1 په پایله وخت کې د نږدې پاداش. Lee et al. (2021) پایله H Skill Mastery architecture د Skill Mastery نمونې لپاره گزارش شوي شمیره د Gato آرشیفیکټ څخه د مخکښ نسخه کارولو لپاره د نمونې لخوا کارولو سره راټول شوي دي. د ResNet پاچ انډول کولو په ځای کې، یو ورته آرشیفیکټ کارول شوی چې د محلي ترانسپورت په کارولو سره د انځور پاچ ټوکنونو انډول کولو لپاره کارول شوی. د محلي موقعیت انډولونو او پاچ پوزیشن انډولونو نه کارول شوي دي. دا بدلونونه ترسره شوي دي او د Pre-training معلوماتو بدلولو وروسته د Gato فعالیت ښه وده ورکړي (چون چې موږ د Skill Mastery چمتو کولو په ځای کې د مهارت عمده کولو په اړه تمرکز وکړئ)، د دې امله دوی زموږ بشپړ ماډل د پایلو آرشیفیکټیکټ په توګه I Additional robotics ablations موږ په نمونې کې یو لړ ablations ترسره کړي ترڅو د روبوټیک په ډومین کې د مختلفو pre-training معلوماتو اغېز ښه درک کړي (د انځور وګورئ) We included the same baselines as in Section د 364M پارامتر اندازه ویډیو انتخاب کول، او همدارنګه د کنترول سیټ ډاټا یوازې سره یو اضافي بڼه روزل کیږي. د DM کنترول یوازې ایجنټ د بیس Gato سره په لټه کې ګټور دی او د ډاټا ټینټینګ ډاټا سره ډیری ګټور دی، کوم چې پریکړه کوي چې Gato ممکن د متن پر بنسټ د ډاټا سیټونو څخه زده شوي پیژندنه نه کاروي کله چې د روبوټیک کارونو لپاره ګټور شي. دا ورته ډاټا یوازې ایجنټ په ټولیز ډول غوره کار کوي، د CRR بڼه سره د 1 ټینټینګ epizod سره مطابقت کوي او دا د ډیرو ډاټا سره ګټور کیږي، کوم چې پریکړه کوي چې Gato په اوسني ک 19). 5.2, J د بصری نظر د ترانسپورت تمرکز وزن ورکولو لپاره، موږ د ترانسپورت د تمرکز لګښتونه، د ټینسر سره د اندازې ( ) where د سرونو شمېر او is the number of tokens in a sequence. The ( )th د دې مټریکس داخل کولی شي د مقدار په توګه تفسیر شي چې د Token انتظار from token د ګاتو انځور tokenization سیسټم له امله، په هر وخت مرحله کې ډیری ټکینونه شتون لري. له دې امله، د يو ځانګړي وخت مرحله لپاره د توجه لپاره، موږ د زیرمټریکس لرو چې د دې وخت مرحله سره مطابقت لري. موږ بیا د دې مټریکس لخوا د اړونده ارزښتونو د عاديولو لپاره یو softmax کارول. ځکه چې موږ یوازې د مخکښ ټکینونو په پام کې دي، موږ د مخکښ څخه د softmax مخکې د منفي بیلابیلو ته تنظیم کړئ. H, T, T H T H، I، J h j i د هر پوښونو مهمیت اندازه کولو لپاره، موږ د مراقبت وزنونه په مطابقت سره د پوښونو په پرتله متوسط شوي. ځکه چې Gato د سبب بدلونکي کاروي، د مراقبت مټریکس د ټریګونال کم دی، له دې امله د متوسط یوازې د مټریکس په پرتله د پوښونو په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله په پرتله. د دې طریقې کارولو په کارولو سره، موږ په لومړي کټګورۍ کې د ترانسپورټر تمرکز نقشهونه ترټولو تفصيلات لري، سره د پایلو سره مطابقت لري ځینې سرونه په واضح ډول د انځور په ځانګړي ځانګړنو او سيمو کې راټول کیږي. انځور د ډیری کارونو لپاره د لومړي کچه کې د manually selected headers لپاره د توجه نقشهونه ښيي. Abnar & Zuidema (2020) په اړه 20 K Detailed results for specialist Meta-World agent د متخصص Meta-World ایجنټ په برخه کې بیان شوی achieves 96.6% success rate averaged over all 50 Meta-World tasks. The detailed success rates are presented in Table موږ د هر دنده لپاره د کارپوه کارپوه 500 ځله تبادلې. 5.5 7. L Per-domain results for Gato We describe performance of Gato for simulated control tasks in Section په جدول کې موږ normalized per-domain پايلې وړاندې کوو. موږ د هر کار لپاره د انجن 50 ځله تبادلې. 4.1 په اړه 8، دا کاغذ د CC by 4.0 Deed (Attribution 4.0 International) لائسنس لاندې archiv کې شتون لري. دا کاغذ د CC by 4.0 Deed (Attribution 4.0 International) لائسنس لاندې archiv کې شتون لري.