DeepMind's Gato ukazuje, ako sa môže jedna AI naučiť všetko naraz

Autori : Scott Reed Konrad Żołna Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Giménez Yury Sulsky Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas Autori : Scott Reedová Konrad Žołna Názov: Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barthová-Maronová Žiadne gymnázium Juraj Sulsky Jackie Kayová Názov: Jost Tobias Springenberg Tomáš Eccles Jake Bruce Alebo rozprávať Ashley Edwardsová Nicolas Heessová Ján Chen Raňa Hadselová Oriolové vinyly Maťo Bordbar Nando de Freitas abstraktné Inšpirovaný pokrokom v modelovaní veľkých jazykov, uplatňujeme podobný prístup k budovaniu jediného generálneho agenta mimo oblasti textových výstupov. Agent, ktorý označujeme ako Gato, funguje ako multi-modálna, multi-task, multi-embodiment generálna politika. Rovnaká sieť s rovnakými váhami môže hrať Atari, obrázky titulkov, chat, stack bloky s reálnym robotovým ramenom a oveľa viac, rozhodovanie na základe jeho kontextu, či na výstup text, spojenie krútiaceho momentu, tlačidlá stlačenia tlačidla alebo iné tokeny. 1 Úvod Existujú významné výhody používania jedného modelu neurónovej sekvencie vo všetkých úlohách. Znižuje potrebu ručne vyrábaných modelov politiky s vhodnými indukčnými predsudkami pre každú doménu. Zvyšuje množstvo a rozmanitosť tréningových údajov, pretože model sekvencie môže vstrebávať akékoľvek údaje, ktoré možno serializovať do plochého sekvencie. Historicky generické modely, ktoré sú lepšie pri využívaní výpočtov, tiež majú tendenciu prekonávať špecializovanejšie prístupy špecifické pre danú oblasť. A nakoniec (Priateľ a al., v roku 2020; Hoffmann a ďalší. 2018 ) a sútona, 2019 → V tomto článku popíšeme aktuálnu iteráciu všeobecného agenta, ktorý nazývame Gato, inštantiovaný ako jeden, veľký, transformátorový sekvenčný model.S jedným súborom váh sa Gato môže zapojiť do dialógu, obrázkov titulkov, hromadiť bloky s reálnym robotovým ramenom, prekonávať ľudí pri hraní hier Atari, navigovať v simulovaných 3D prostrediach, postupovať podľa pokynov a ďalšie. Zatiaľ čo žiadny agent nemožno očakávať, že vynikne vo všetkých predstaviteľných kontrolných úlohách, najmä tých, ktoré sú ďaleko mimo jeho tréningovej distribúcie, testujeme tu hypotézu, že tréningový agent, ktorý je vo všeobecnosti schopný na Predpokladáme, že takýto agent možno získať škálovaním dát, výpočtových a modelových parametrov, neustále rozširovaním distribúcie školenia pri zachovaní výkonu, smerom k pokrytiu akejkoľvek úlohy, správania a uskutočnenia záujmu.V tomto nastavení môže prirodzený lan-guage pôsobiť ako spoločný základ pre inak nekompatibilné uskutočnenia, odomknutím kombinatorickej generalizácie pre nové správanie. Veľké číslo Naše školenie sa zameriava na prevádzkový bod modelovej stupnice, ktorý umožňuje riadenie robotov v reálnom svete v reálnom čase, v súčasnosti okolo parametrov 1.2B v prípade Gato. Keď sa hardvérové a modelové architektúry zlepšujú, tento prevádzkový bod prirodzene zvýši uskutočniteľnú veľkosť modelu, čím generalizované modely posúvajú vyššie na krivku škálovania. Pre jednoduchosť bol Gato vyškolený offline čisto dohliadaným spôsobom; avšak v zásade neexistuje žiadny dôvod, prečo by sa nemohol vyškoliť ani s offline alebo on-line posilňovacím učením (RL). 2 Modely Riadiaci princíp dizajnu spoločnosti Gato je trénovať na čo najširšej škále relevantných údajov, vrátane rozmanitých modalít, ako sú obrázky, text, propriocepcia, spoločný krútiaci moment, tlačidlá a iné diskrétne a nepretržité pozorovania a akcie. Aby sme umožnili spracovanie týchto multi-modálnych údajov, serializujeme všetky dáta do rovnej sekvencie tokenov. V tejto reprezentácii môže byť Gato trénovaný a odoberaný vzorky z podobného štandardného veľkoobjemového jazykového modelu. Počas nasadenia sú odoberané tokeny zostavené do dialógových odpovedí, titulkov, tlačidiel alebo iných akcií založených na kontexte. 2.1 Tokenizácia Existuje nekonečné množstvo možných spôsobov, ako premeniť dáta na tokeny, vrátane priameho použitia surového podkladového prúdu bajtov. Nižšie uvádzame tokenizačnú schému, ktorú sme našli na dosiahnutie najlepších výsledkov pre Gato v súčasnom meradle pomocou moderných hardvérových a modelových architektúr. Text je kódovaný cez SentencePiece (Kudo & Richardson, 2018) s 32000 podslovami do rozsahu celých čísel [0, 32000]. Obrázky sa najprv transformujú na sekvencie neprekrývajúcich sa 16 16 náplastí v rasterovom poradí, ako sa to robí v ViT (Dosovitskiy et al., 2020). Každý pixel v obraze __p__atches je potom normalizovaný medzi [−1*,* 1] a rozdelený štvorcovým koreňom veľkosti náplasti (t. j. √16 = 4). Diskrétne hodnoty, napr. stlačenie tlačidla Atari, sú vyrovnané do sekvencií celých čísel v poradí veľkosti. Tokenizovaný výsledok je sekvencia celých čísel v rozsahu [0*,* 1024). Kontinuálne hodnoty, napr. propriocepčné vstupy alebo kĺbové krútiace momenty, sú najprv vyrovnané do sekvencií hodnôt plávajúceho bodu v riadkovom poradí. Hodnoty sú kódované do rozsahu [ 1*,* 1] ak už tam nie sú (pozri obrázok 14 pre podrobnosti), potom diskretizované na 1024 jednotných binov. Diskrétne celé čísla sú potom presunuté do rozsahu [32000*,* 33024]. Po premenení údajov na tokeny používame nasledujúce kanonické poradie. Textové tokeny v rovnakom poradí ako surový vstupný text. Obrazové patch tokeny v rasterovom poradí. Tenzory v radovom poriadku. Nástenné štruktúry v lexikografickom poradí podľa kľúča. Časové kroky agentov ako tokeny pozorovania, po ktorých nasleduje separátor, potom tokeny akcie. Agentové epizódy ako časové kroky v časovom poradí. Ďalšie podrobnosti o údajoch tokenizujúcich agentov sú uvedené v doplnkovom materiáli (oddiel b) na 2.2 Vkladanie vstupných tokenov a nastavenie výstupných cieľov Po tokenizácii a sekvenciovaní aplikujeme parameterizovanú vkladaciu funkciu *f* ( ; *θe*) na každý token (t. j. sa aplikuje na pozorovania aj akcie) na vytvorenie konečného vstupu modelu. • Tokeny patriace k textovým, diskrétnym alebo kontinuálne hodnoteným pozorovaniam alebo akciám pre akýkoľvek časový krok sú vložené prostredníctvom vyhľadávacej tabuľky do učeního vektorového vkladacieho priestoru. • Tokeny patchov obrázkov pre akýkoľvek časový krok sú vložené pomocou jediného ResNet Pre vkladanie tokenov obrázka patch, pridáme aj učiteľný vektor kódovania polohy v rámci obrázka. On a Al., 2016 a) Odkazujeme na prílohu pre úplné podrobnosti o vkladacej funkcii. C3 → Keďže dáta modelujeme autoregresívne, každý token je potenciálne aj cieľovým štítkom vzhľadom na predchádzajúce tokeny. Textové tokeny, diskrétne a kontinuálne hodnoty a akcie môžu byť priamo nastavené ako ciele po tokenizácii. Obrazové tokeny a pozorovania agentov nie sú v súčasnosti predpovedané v Gato, hoci to môže byť zaujímavý smer pre budúcu prácu. Ciele pre tieto nepredvídateľné tokeny sú nastavené na nevyužitú hodnotu a ich príspevok k strate je maskovaný. 2.3 Školenie Vzhľadom na sekvenciu tokenov 1 : → a parametre , modelujeme údaje pomocou pravidla reťazca pravdepodobnosti: s L θ Poďme definujeme maskovacie funkcie *m* tak, že *m*(*b, l*) = 1 ak je token na indexe *l* buď z textu alebo zo zaznamenanej akcie agenta, a 0 inak. b Ako bolo popísané vyššie, sieťová architektúra spoločnosti Gato má dve hlavné zložky: parameterizovanú vkladaciu funkciu, ktorá transformuje tokeny na vkladanie tokenov, a sekvenčný model, ktorý produkuje distribúciu nad ďalším diskrétnym tokenom. pre jednoduchosť a škálovateľnosť. Gato používa transformátor iba s dekodérom s parametrom 1.2B s 24 vrstvami, veľkosťou vloženia 2048 a skrytou veľkosťou 8196 pre post-attention feedforward (viac podrobností v sekcii Asad a ďalší, 2017 → C 1 ) Keďže rozdielne úlohy v rámci domény môžu zdieľať rovnaké vykonávacie formáty, formáty pozorovania a špecifikácie akcií, model niekedy potrebuje ďalší kontext na rozlíšenie úloh. a použiť prompt kondicionovanie. Počas tréningu, pre 25% sekvencií v každej dávke, prompt sekvencia je prependovaná, pochádzajúca z epizódy generovanej rovnakým zdrojovým agentom na rovnakej úlohe. Polovica prompt sekvencií je z konca epizódy, pôsobí ako forma cieľového kondicionovania pre mnoho domén; a druhá polovica je rovnomerne odobratá z epizódy. Počas hodnotenia môže byť agent vyzvaný pomocou úspešnej demonštrácie požadovanej úlohy, ktorú robíme predvolene vo všetkých kontrolných výsledkoch, ktoré tu prezentujeme. (Slávka a al., v roku 2022; Viedeň et al. v roku 2021; Brown a ďalší. v roku 2020) Školenie modelu sa vykonáva na 16x16 TPU v3 dlaždice pre 1M kroky s veľkosťou dávky 512 a dĺžkou sekvencie tokenov = 1024, čo trvá približne 4 dni. podrobnosti o architektúre nájdete v sekcii Pretože epizódy agentov a dokumenty môžu ľahko obsahovať oveľa viac tokenov, než sa hodí do kontextu, náhodne vzorkujeme následné sekvencie Každá dávka zmieša podsekvencie približne rovnomerne cez domény (napr. Atari, MassiveWeb atď.), s nejakým manuálnym zvážením väčších a kvalitnejších dátových súborov (pozri tabuľku). v sekcii pre všetky detaily) L c ť L 1 3 2.4 Nasadenie Zavedenie mačky ako politiky je znázornené na obrázku Najprv je prompt, ako napríklad demonštrácia, tokenizovaný, čím sa vytvorí počiatočná sekvencia. Predvolene vezmeme prvých 1024 tokenov demonštrácie. Ďalej prostredie prináša prvé pozorovanie, ktoré je tokenizované a pripojené k sekvencii. Gato sampluje akčný vektor autoregresívne jeden token naraz. Akonáhle boli všetky tokeny pozostávajúce z akčného vektoru odobraté (určené špecifikáciou akcie prostredia), akcia sa dešifruje invertovaním postupu tokenizácie opísaného v sekcii Táto akcia je odoslaná do prostredia, ktoré kroky a prináša nové pozorovanie. Postup sa opakuje. Model vždy vidí všetky predchádzajúce pozorovania a akcie vo svojom kontextovom okne 1024 tokenov. Zistili sme, že je užitočné používať pamäť transformátora XL počas nasadenia, hoci sa nepoužíva počas tréningu 3. 2.1 Zľavy (Daj a al., 2019 ) 3 Databáza Gato je vyškolený na veľkom počte dátových súborov, ktoré zahŕňajú skúsenosti agentov v simulovaných aj reálnych prostrediach, ako aj na rôznych dátových súboroch prirodzeného jazyka a obrazu. Približný počet tokenov na kontrolnú databázu sa vypočíta za predpokladu, že tokenizačný mechanizmus je popísaný v časti 1. 2.1 Zľavy 3.1 Simulované kontrolné úlohy Naše kontrolné úlohy pozostávajú z dátových súborov generovaných špecialistami SoTA alebo agentmi na posilňovanie v blízkosti SoTA vyškolenými v rôznych prostrediach.Pre každé prostredie zaznamenávame podskupinu skúseností, ktoré agent generuje (štáty, akcie a odmeny) počas tréningu. Simulované prostredia zahŕňajú Meta-World (Y zavedený na benchmark meta-zosilňovanie učenia a multi-task učenie, Sokoban navrhnutý ako problém plánovania, BabyAI pre jazykové inštrukcie nasledujúce v grid-worlds, DM Control Suite (T pre nepretržitú kontrolu, ako aj DM Lab navrhnutý tak, aby učil agentov navigáciu a 3D videnie z surových pixelov s egocentrickým pohľadom. s klasickými hrami Atari (používame dve sady hier, ktoré nazývame ALE Atari a ALE Atari Extended, pozri časť pre všetky detaily) Vy a Al. v roku 2020) (Prednáška a ďalší, 2017 → (Výkonný riaditeľ a ďalší, 2018 → ZľavyZľavy a podnikanie v roku 2020) (Beattie a ďalší) 2016 → (Priateľ a al., rok 2013) F1 → Využívame aj Procgen Benchmark Modulárne RL Zahŕňame aj štyri úlohy s použitím simulovanej ruky Kinova Jaco z DM Manipulation Playground, ako je uvedené v sekcie obsahuje podrobnejší opis týchto kontrolných úloh spolu s tým, aký RL agent bol použitý na generovanie údajov. (Súkromné a al., v roku 2020) (Slovenský a al., v roku 2020). Zola et al. 2020 → F Zistili sme, že je efektívne trénovať na filtrovanom súbore epizód s návratmi aspoň 80 % odbornej návratnosti pre úlohu. Odborník meria maximálnu udržateľnú výkonnosť, ktorú môže odborník dosiahnuť. Definujeme ju ako maximálnu hodnotu nad súborom všetkých okenných priemerných návratov vypočítaných pre všetky zhromaždené epizódy pre úlohu: kde je celkový počet zhromaždených epizód pre úlohu, is the window size, and Je to úplný návrat k epizóde Aby sme získali presné odhady, v praxi stanovujeme byť 10% z celkového množstva údajov alebo minimálne 1000 epizód (t. j. = min(1000 * 0 * * 1 a ) ) N W ri i W W × N 3.2 Vízia a jazyk Gato je vyškolený na MassiveText Zbierka veľkých anglických textových dátových súborov z viacerých zdrojov: webových stránok, kníh, novinových článkov a kódu. (Rae et al. a ďalšie) 2019) a Do výcviku Gato sme zahrnuli aj niekoľko dátových súborov v jazyku videnia. ALIGN pozostáva z 1,8B obrázkov a ich alternatívnych textových (alt-textových) poznámok. LTIP (Long Text & Image Pairs), pozostáva z 312 miliónov obrázkov s titulkami , , Koncepčné kapitoly Kokosové kapsuly , , sú titulky dátových súborov s 3,3M a 120k obrázok-textové páry, resp. MultiModal MassiveWeb (M3W) dataset . , , obsahuje 43M webových stránok, kde bol extrahovaný text aj obrázky. Taktiež sme zahrnuli vizuálne dátové súbory na odpovede na otázky. Prehľad VQAv2 s 9K a 443K trojuholníkovými obrázkami, otázkami a odpoveďami. Aby sme z nich vytvorili tréningovú epizódu, vyskúšame päť párov (obrázok, text), tokenizujeme ich, spojíme a potom padneme alebo náhodne preraďujeme na požadovanú dĺžku tréningovej sekvencie. (Ján a ďalší) 2018 ) (Zdroj: Alžbeta a al. 2018 ) (Sharma a ďalší) 2018 → (Slovenský a al. 2015 ) (Slovenský a al 2018 ) - námorníctvo a aj al, 2019 ) (Vydavateľstvo a al., 2015 ) 3.3 Robotics - RGB Stacking Benchmark (skutočný a sim) Ako testovací súbor údajov pre fyzické akcie v reálnom svete sme si vybrali robotické blokové stackingové prostredie zavedené [Lee et al.](#_bookmark89) [(2021).](#_bookmark89) Prostredie pozostáva z Sawyerovho robotického ramena s 3-DoF kartézovým riadením rýchlosti, dodatočným DoF pre rýchlosť a diskrétnym gripperovým pôsobením. Pracovný priestor robota obsahuje tri plastové bloky farebné červené, zelené a modré s rôznymi tvarmi. Dostupné pozorovania zahŕňajú dve 128 kamierové snímky, robotové rameno a rukoväť kĺbových uhlov, ako aj koncový efektorový pozíciu robota. Pozoruhodné je, že informácie o V Skill Generalization, pre simuláciu aj real, používame údaje zhromaždené najlepším generálnym agentom sim2real z We collected data only when interacting with the designated RGB-stacking (toto predstavuje celkovo 387 000 úspešných trajektórií v simulácii a 15 000 trajektórií v reálnom živote). v simulácii a z najlepšej sim2real politiky na skutočnom robote (počítajúc celkovo 219k trajektórií). Lea a al. 2021 ) Vzdelávacie objekty Lee et al. (2021) → 5.4 Zľavy 4 schopnosti generálneho agenta V tejto časti zhrnieme výkonnosť Gato pri výcviku na vyššie popísaných údajoch. To znamená, že všetky výsledky vo všetkých úlohách sú odvodené z jedného predtrénovaného modelu s jedinou sadou váh. 5. 4.1 Simulované kontrolné úlohy figúrka shows the number of distinct control tasks for which Gato performs above a given score threshold, relative to expert performance demonstrated in Gato’s training data. 5 We report performance as a percentage, where 100% corresponds to the per-task expert and 0% to a random policy. For each simulated control task we trained our model on, we roll out the Gato policy on the corresponding environment 50 times and average the defined scores. As shown in Figure Gato performs over 450 out of 604 tasks at over a 50% expert score threshold. 5 , In ALE Atari Gato dosahuje priemerné ľudské (alebo lepšie) skóre na 23 hrách Atari , achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section kde predstavujeme špecialistu jedného domény ALE Atari agent dosahujúci lepšie ako ľudské skóre pre 44 hier). (Priateľ a al., rok 2013) 1 5.5 On BabyAI Gato dosahuje viac ako 80% odborných skóre pre takmer všetky úrovne Pre najťažšiu úlohu, nazvanú BossLevel, Gato dosahuje 75%. Ďalšie dve publikované základné hodnotenia, ktoré by sme mohli nájsť, BabyAI 1.0 a BabyAI 1.1 , , scored 77% and 90%, respectively, having trained on this single task alone using a million demonstrations. (Chevalier-Boisvert et al., 2018 → 2 (Hui et al. v roku 2020), On Meta-World (Y Gato achieves more than 50% for all 44 out of 45 tasks that we trained on, over 80% for 35 tasks, and over 90% for 3 tasks. On canonical DM Control Suite (T Gato dosahuje lepšie ako 50% expertného skóre na 21 z 30 úloh od štátu a viac ako 80% na 18 úloh. u et al., v roku 2020) assa et al., 2018 ) , 4.2 Robotics First person teleoperation enables the collection of expert demonstrations. However, such demonstrations are slow and costly to collect. Data-efficient behavior cloning methods are therefore desirable for training a generalist robot manipulator and offline pretraining is thus a well-motivated area of research. To that end, we evaluated Gato on the established RGB Stacking benchmark for robotics. Skill Generalizácia výkonu The Skill Generalization challenge from the RGB Stacking robotics benchmark tests the agent’s ability to stack objects of previously unseen shapes. The agent is trained on a dataset consisting of episodes of the robot stacking objects with a variety of different shapes. Five triplets of object shapes are, however, not included in the training data and serve as test triplets. We evaluated the trained generalist for 200 episodes per test triplet on the real robot. Table shows that our generalist agent’s success rate on each test triplet is comparable to the single task BC-IMP (filtered BC) baseline in 2 Lea a al. 2021 ) 4.3 Text samples The model demonstrates rudimentary dialogue and image captioning capabilities. Figure contains a rep-resentative sample of Gato’s image captioning performance. Figure shows some hand-picked examples of plain text dialogue exchange. 6 7 5 Analýza 5.1 Scaling Laws Analysis In Figure analyzovali sme súhrnnú výkonnosť v distribúcii predtrénovaného modelu ako funkciu počtu parametrov, aby sme získali prehľad o tom, ako by sa výkon mohol zlepšiť so zvýšenou kapacitou modelu. Vyhodnotili sme 3 rôzne veľkosti modelu (merané počtom parametrov): model 79M, model 364M a model 1.18B (Gato). podrobnosti o troch modelových architektúrach. 8 , C Here, for all three model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 1). Then for each domain listed in Table priemerné percentuálne skóre vo všetkých úlohách pre danú doménu. Nakoniec zhromažďujeme priemerné percentuálne skóre vo všetkých doménach. Môžeme vidieť, že pri ekvivalentnom počte tokenov dochádza k výraznému zlepšeniu výkonu so zvýšeným rozsahom. 4. 1 5.2 Out of distribution tasks In this section we want to answer the following question: Z tohto dôvodu sme vyhodili všetky dáta pre štyri úlohy z našej predtréningovej sady: cartpole.swingup (doména DM Control Suite), assembly-v2 (doména Meta-World), order_of_apples_forage_simple (doména DM Lab) a boxing (doména ALE Atari). Môže byť náš agent použitý na riešenie úplne novej úlohy efektívne? Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment. Fine-tuning is very similar to pretraining with minor changes, such as different learning rate schedule; see Section Pre podrobnosti . E We want to measure how choice of data used during pretraining influences post-fine-tuning performance. To this end, we compare Gato (trained on ) to variants trained on ablated datasets: all data 1. A model pretrained only on data from the same domain as the task to be fine-tuned on, . same domain only data 2. A model pretrained only on non-control data, . no control data 3. A model fine-tuned from scratch, i.e. no pretraining at all, . scratch Considering as all these experiments require training a new model from scratch and then also fine-tuning, we present results using the less compute-intensive 364M parameter architecture described in Section Results are shown in Figure 5.1. 9. Fine-tuning výkon na oboch cartpole.swingup a assembly-v2 úlohy, ktoré nevyžadujú spracovanie obrazu, prezentujú podobné trendy. Predškolenie na všetkých dátových súboroch prináša najlepšie výsledky, po ktorom nasleduje predškolenie na tej istej doméne len. Tento rozdiel je menší pre assembly-v2 ale konzistentný pre všetky niekoľko dátových súborov. Pre tieto prostredia, ktoré nie sú založené na obraze, vidíme buď žiadny prínos (cartpole.swingup) alebo dokonca negatívny prenos (assembly-v2) pri predškolení na datasets, which only contain images and text data. no control Výsledky pre DM Lab order_of_apples_forage_simple sú mierne odlišné. Predškolenie na údaje DM Lab je už dosť na to, aby sa priblížil k maximálnej odmeny 19 a preto neexistuje žiadny pozorovateľný prínos z pridania údajov z rôznych prostredí. Čo sa líši v porovnaní s predtým analyzovanými prostredím bez videnia je, že predškolenie na údaje pomáhajú, čo možno vysvetliť tým, že agenti v prostredí DM Lab sa živia obrazmi, ktoré napriek tomu, že sú simulované, vyzerajú prirodzene. no control We were not able to observe any benefit from pretraining on boxing. The randomly initialized model seems to work better than any of the pretrained variants considered. We hypothesise that this is caused by the game’s input images being visually very distinct from the other data, suggesting transfer is difficult. We discuss this Atari challenge further in our related work section. 5.3 Fine-tuning on Robotic Stacking Tasks sekcie demonstrates that the base Gato capable of a diverse array of tasks can perform competitively on the RGB Stacking Skill Generalization benchmark. In this section, we would like to answer the following question: *How does our agent improve on robotics tasks when allowed to fine-tune similarly to how we fine-tune on new tasks in Section *We consider different model sizes and analyse the impact of pretraining datasets on the Skill Generalization benchmark, as well as a novel out of distribution task. Further analysis of fine-tuning with dataset ablations is in Appendix 4.2 5 2 ? I. Skill Generalization Najprv by sme chceli ukázať, že jemné nastavenie na objektovo špecifických údajoch, podobne ako to, čo bolo vykonané is beneficial. Therefore, we fine-tuned Gato separately on five subsets of demonstrations from the dataset. Each subset was obtained by random partitioning of a test dataset consisting of demonstrations gathered by a generalist sim-to-real agent stacking real test objects. We consider this setting, which is comparable to the fine-tuning baselines on RGB stacking tasks from and use the 5k dataset that their behavior cloning 5k results are obtained with. To best match their experiments, we change our return filtering scheme during training: instead of using only successful stacks, we condition on the normalized return of the episode. Lee et al. (2022), test (Lee a ďalší) 2022); Figure compares the success rate of Gato across different fine-tuning data regimes to the sim-to-real expert and a Critic-Regularized Regression (CRR) agent trained on 35k episodes of all test triplets. Gato, in both reality and simulation (red curves on the left and right figure, respectively), recovers the expert’s performance with only 10 episodes, and peaks at 100 or 1000 episodes of fine-tuning data, where it exceeds the expert. After this point (at 5000), performance degrades slightly but does not drop far below the expert’s performance. 10 (Wang et al., v roku 2020) Fine-tuning a veľkosť modelu Aby sme lepšie pochopili prínos veľkých modelov pre adaptáciu v oblasti robotiky, vykonali sme abláciu veľkosti parametrov modelu. Táto časť sa zameriava na hodnotenie v simulácii. compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time. 10 Adaptation to Perceptual Variations While the Skill Generalization task is an effective benchmark for motor Skill Generalization to shape varia-tions, it does not test the agent’s ability to adapt to perceptual variations and permutations in the objective specification. To further evaluate Gato’s generalization capabilities, we devised a new task in the RGB stacking benchmark where the goal is to stack the blue object on the green object, for test triplet 1 (see Figure First, we used a 3D mouse to collect 500 demonstrations of this task on the real robot, for a total of 2 hours and 45 minutes of demonstration data, and fine-tuned Gato on these episodes. Notably, all of the simulated and real robotics data in the pretraining set shows the robot successfully stacking the red object on the blue object, and the data does not include the object shapes in the test set. We found that additionally adding simulated demonstrations of the stack blue on green task to the fine-tuning dataset improved performance, and 10% was an ideal sampling ratio for this data. 11). We achieved a final 60% success rate after evaluating fine-tuned Gato on the real robot, while a BC baseline trained from scratch on the blue-on-green data achieved only 0.5% success (1/200 episodes). Qualitatively, the BC baseline would consistently move towards the blue object and occasionally pick it up and place it on top of the green object, but a full, stable stack was almost never achieved. 5.4 Robotics: Skill Mastery Podobne ako výzva zameraná na zovšeobecnenie zručností, o ktorej sa diskutovalo v sekcii the Skill Mastery challenge consists in training a robotic arm to stack blocks of different shapes. However, the Skill Mastery allows the agent to train on data involving the object shapes used for evaluation, i.e. the set in Skill Generalization becomes a part of the Skill Mastery set. Thus, this challenge serves to measure Gato’s performance on in-distribution tasks (possibly with initial conditions not seen in the training demonstrations). Our Skill Mastery results use an earlier version of the Gato architecture described in Appendix with no fine-tuning. 4.2, test training H, Table compares the group-wise success percentage and the average success across object groups for Gato and the established BC-IMP baseline. Gato exceeds or closely matches BC-IMP’s performance on all but one training triplet. 3 5.5 Specialist single-domain multi-task agents In this section we show results obtained with two specialist (rather than generalist) agents. Both of them were trained on data from a single domain only and rolled out 500 times for each training task without any per-task fine-tuning. Meta-World The first agent uses the smallest architecture introduced in Section i.e. 79M parameters, and is trained on all 50 Meta-World tasks. While Gato has access to the state of the MuJoCo physics engine and unlimited task seeds, the agent presented here has no access to any extra features or tasks and uses the canonical API as in (Y This experiment is to show that the architecture proposed in our paper can be used to obtain state-of-the-art agents also at small scale. The training procedure was to train single-task MPO experts on each of the MT-50 tasks individually, recording the trajectories produced while training. This experience is then combined, or distilled, into a single agent, which achieves 96.6% success rate averaged over all 50 tasks. To the best of our knowledge this agent is the first one to accomplish nearly 100% average success rate simultaneously (multi-task) for this benchmark. See Table in the supplementary material (Section pre úplný zoznam úloh a zodpovedajúce úspešnosti nášho agenta. 5.1, u et al., v roku 2020). (Zdroj: Abdolmaleki a ďalší) 2018 → 7 K) Ale aj Atari We also trained a specialist agent on all 51 ALE Atari tasks. As the Atari domain is much more challenging than Meta-World, we used the Gato architecture with 1.18B parameters. The resulting agent performs better than the average human for 44 games (see Section Pre podrobnosti o našom hodnotení a skóre).Chceme poznamenať, že výkonnosť online expertov používaných na generovanie tréningových údajov pre ostatných 7 hier bola tiež pod priemerom človeka. 4.1 The specialist Atari agent outperforms our generalist agent Gato, which achieved super-human performance on 23 games. It suggests that scaling Gato may result in even better performance. We, however, purposely restricted Gato’s size such that it can be run in real-time on the real robot. 5.6 Attention Analysis We rendered the transformer attention weights over the image observations for various tasks, to gain a qualitative sense of how Gato attends to different regions of the image across tasks (see Figure Further details and visualizations for more tasks can be found in Appendix These visualizations clearly show that attention tracks the task-relevant objects and regions. 12). J. 5.7 Vstavaná vizualizácia To understand how Gato encodes differently information per task, we visualized per-task embeddings. We analysed 11 tasks. For each task, we randomly sample 100 episodes and tokenize each of them. Then, from each episode we take a subsequence of 128 tokens, compute their embeddings (at layer 12, which is half the total depth of the transformer layers) and average them over the sequence. The averaged embeddings for all tasks are used as input to PCA, which reduces their dimensionality to 50. Then, T-SNE is used to get the final 2D embeddings. Figure zobrazuje konečné vkladanie T-SNE vykreslené v 2D, farbené podľa úlohy. Vkladanie z rovnakých úloh je jasne zoskupené a zoskupenia úloh z rovnakej domény a modality sú tiež umiestnené blízko k sebe. Dokonca aj úloha, ktorá bola vykonaná (cartpole.swingup) je správne zoskupená a položená vedľa inej úlohy z DM Control Suite Pixels. 13 6 Related Work Najbližšie súvisiace architektúry s Gato sú Decision Transformers , and Trajectory Transformer which showed the usefulness of highly generic LM-like architectures for a variety of control problems. Gato also uses an LM-like architecture for control, but with design differences chosen to support multi-modality, multi-embodiment, large scale and general purpose deployment. Pix2Seq also uses an LM-based architecture for object detection. Perceiver IO ., uses a transformer-derived architecture specialized for very long sequences, to model any modality as a sequence of bytes. This and similar architectures could be used to expand the range of modalities supported by future generalist models. (Chen et al., 2021b; Reidová et al., 2022; Zheng et al., 2022; Furuta et al. 2021) Ján a al., 2021), (Chen et al., 2022) (Jaegle et al 2021) Gato was inspired by works such as GPT-3 and Gopher posúvanie hraníc všeobecných jazykových modelov; a v poslednej dobe Flamingo generalist visual language model. developed the 540B parameter Pathways Language Model (PalM) explicitly as a generalist few-shot learner for hundreds of text tasks. (Brown et al., 2020) (Rae et al., 2021), (Alayrac et al., 2018 ) Chowdhery et al. (2022) → Budúca práca by mala zvážiť, ako zjednotiť tieto textové schopnosti do jedného plne generalizovaného agenta, ktorý môže tiež pôsobiť v reálnom čase v reálnom svete, v rôznych prostrediach a uskutočneniach. Gato also takes inspiration from recent works on multi-embodiment continuous control. Použil sa na vytvorenie jediného lokomotívneho ovládača pre mnoho simulovaných 2D variantov chodcov. showed that transformers can outperform graph based approaches for incom-patible (i.e. varying embodiment) control, despite not encoding any morphological inductive biases. learn a modular policy for multi-task and multi-robot transfer in simulated 2D manipulation environments. train a universal policy conditioned on a vector representation of robot hardware, showing successful transfer both to simulated held out robot arms, and to a real world sawyer robot arm. Huang et al. (2020) Kurin et al. (2020) Devin et al. (2017) Chen et al. (2018) A variety of earlier generalist models have been developed that, like Gato, operate across highly distinct domains and modalities. NPI trained a single LSTM to execute diverse programs such as sorting an array and adding two numbers, such that the network is able to generalize to larger problem instances than those seen during training. developed the MultiModel that trains jointly on 8 distinct speech, image and text processing tasks including classifica-tion, image captioning and translation. Modality-specific encoders were used to process text, images, audio and categorical data, while the rest of the network parameters are shared across tasks. proposed “ ”, describing a method for the incremental training of an increasingly general problem solver. proposed controllable multi-task language models that can be directed according to language domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. (Reed & De Freitas, 2016) (Hochreiter & Schmidhuber, 1997) Kráľovstvo et al. (2017) Schmidhuber (2018) one big net for everything Keskar et al. (2019) V tejto diskusii je dôležité rozlišovať medzi jednou architektúrou siete s viacerými úlohami a jednou neurálnou sieťou s rovnakou váhou pre všetky úlohy.Niekoľko agentov poplar RL dosahuje dobré výsledky RL s viacerými úlohami v rámci jednotlivých domén, ako sú Atari57 a DMLab Je však oveľa bežnejšie používať rovnakú architektúru politiky a hyperparametre v rámci úloh, ale parametre politiky sa v každej úlohe líšia. This is also true of state-of-the-art RL methods applied to board games Moreover, this choice has been adopted by off-line RL benchmarks and recent works on large sequence neural networks for control, including decision transformers and the Trajectory Transformer of In contrast, in this work we learn a single network with the same weights across a diverse set of tasks. (Espeholt et al., 2018; Song et al., 2020; Hessel et al., 2019). (Mnih et al., v roku 2015; Tassa et al., 2018). (Schrittwieser et al., 2020). (Gulcehre et al., 2020; Fú a al. v roku 2020) (Chen et al., 2021b; Reid et al., 2022; Zheng et al., 2022) Ján a al. (2021). Recent position papers advocate for highly generalist models, notably proposing one big net for everything, and on foundation models. However, to our knowledge there has not yet been reported a single generalist trained on hundreds of vision, language and control tasks using modern transformer networks at scale. Schmidhuber (2018) Bommasani et al. (2021) “Single-brain”-style models have interesting connections to neuroscience. famously stated that “ ”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence Mountcastle (1978) the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex (Hawkins & Blakeslee, 2004). Sensory substitution provides another argument for a single model For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect. (Bach-y Rita & Kercel, 2003). Our work is based on deep autoregressive models, which have a long history and can be found in generative models of text, images, video and audio. Combining autoregressive generation with transformers (V has been of enormous impact in language modelling protein folding vision-language models (T Generácia kódu dialogue systems with retrieval capabilities speech recognition neural machine translation and more , Recently researchers have explored task decomposition and grounding with language models aswani et al., 2017; Devlin et al., 2018) (Brown et al., 2020; Rae et al., 2021), (Jumper et al., 2021), simpoukelli et al., 2021; Wang et al., 2021; Alayrac et al., 2022), (Chen et al., 2019C Li et al., 2022b), (Nakano et al., 2021; Thoppilan et al., 2022), (Pratap et al., 2020), (Johnson et al., 2019) (Bommasani et al. 2021). (Huang et al., 2022; Ahn et al., 2022). construct a control architecture, consisting of a sequence tokenizer, a pretrained language model and a task-specific feed-forward network. They apply it to VirtualHome and BabyAI tasks, and find that the inclusion of the pretrained language model improves generalisation to novel tasks. Similarly, demonstrate that vision models pretrained with self-supervised learning, especially crop segmentations and momentum contrast can be effectively incorporated into control policies. Li et al. (2022a) Paríž a al. (2022) On a Al., 2020), Ako už bolo spomenuté, transfer v Atari je náročný. researched transfer between ran-domly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Rusu et al. (2016) Kanervisto et al. (2020). There has been great recent interest in data-driven robotics However, note that in robotics “ ”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data. (Zdroj a al., 2019; Chen et al., 2017 a) Bommasani et al. (2021) the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments Generating actions using an autoregressive model can lead to causal “self-delusion” biases when there are confounding variables For example, sampling actions can condition the model to solve the wrong task when multiple tasks share similar observation and actions specifications. As explained in Section we use prompt engineering in ambiguous tasks, conditioning our model on a successful demon-stration. This screens off confounding variables, reducing self-delusions. Another solution which we did not explore in this work is to use counterfactual teaching, where we train a model online using instantaneous expert feedback. We leave this for future investigation. (Ortega et al., 2021). 2, 7 Broader Impact Although generalist agents are still only an emerging area of research, their potential impact on society calls for a thorough interdisciplinary analysis of their risks and benefits. For the sake of transparency, we document the intended use cases of Gato in the model card in Appendix However, the tools for mitigating harms of generalist agents are relatively underdeveloped, and require further research before these agents are deployed. A. Since our generalist agent can act as a vision-language model, it inherits similar concerns as discussed in In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. (Wei-dinger et al., 2021; Bommasani et al., v roku 2021; Rae et al., 2021; Alayrac et al., 2022). Technická bezpečnosť AGI may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (R are especially important for the design of human-compatible generalist agents. It may be possible to extend some of the value alignment approaches for language to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight Toto obmedzenie zdôrazňuje potrebu starostlivého dizajnu a procesu nasadenia, ktorý zahŕňa viaceré disciplíny a pohľady. (Bostrom, 2017) ussell, 2019) (Ouyang et al., 2022; Kenton et al., 2018 ) (Amodei et al., 2016). Understanding how the models process information, and any emergent capabilities, requires significant ex-perimentation. External retrieval has been shown to improve both interpretability and performance, and hence should be consid-ered in future designs of generalist agents. (Borgeaud et al., 2021; Menick et al., v roku 2022; Nakano et al., 2021; Thoppilan et al., 2022) Although still at the proof-of-concept stage, the recent progress in generalist models suggests that safety researchers, ethicists, and most importantly, the general public, should consider their risks and benefits. We are not currently deploying Gato to any users, and so anticipate no immediate societal impact. However, given their potential impact, generalist models should be developed thoughtfully and deployed in a way that promotes the health and vitality of humanity. 8 Obmedzenia a budúca práca 8.1 RL data collection Gato is a data-driven approach, as it is derived from imitation learning. While natural language or image datasets are relatively easy to obtain from the web, a web-scale dataset for control tasks is not currently available. This may seem at first to be problematic, especially when scaling Gato to a higher number of parameters. Offline RL sa zameriava na využívanie existujúcich dátových súborov riadenia a jeho rastúca popularita už viedla k dostupnosti rozmanitejších a väčších dátových súborov. Vyrábajú sa bohatšie prostredia a simulácie (napr. Metaverse) a čoraz viac používateľov s nimi už interaguje medzi tisíckami už nasadených online hier (napr. existuje veľká databáza hier Starcraft 2). Reálne dáta boli tiež už uložené na účely výskumu ML; napríklad údaje pre výcvik autonómnych áut sa získavajú z nahrávania údajov o ľudských vodičoch. (Baker et al., 2022). Thanks to online video sharing and streaming platforms such as Youtube and Twitch, observation-only datasets are not significantly more difficult to collect than natural language datasets, motivating a future research direction to extend Gato to learn from web data. While the previous paragraph focuses on alleviating drawbacks of data collection from RL agents, it is important to note that this approach presents a different set of tradeoffs compared to scraping web data and can be actually more practical in some situations. Once the simulation is set up and near SOTA agent trained, it can be used to generate massive amounts of high quality data. That is in contrast to the quality of web data which is notorious for its low quality. Stručne povedané, veríme, že získavanie vhodných údajov je ďalšou výskumnou otázkou sama o sebe a toto je aktívna oblasť výskumu s rastúcim momentom a významom. 8.2 Prompt and short context Gato je vyzvaný s odbornou demonštráciou, ktorá pomáha agentovi vydať akcie zodpovedajúce danej úlohe.To je obzvlášť užitočné, pretože inak agentovi nie je k dispozícii žiadny identifikátor úlohy (čo je na rozdiel od mnohých nastavení multi-task RL). Gato odvodzuje príslušnú úlohu z pozorovaní a akcií v popruhu. However, the context length of our agent is limited to 1024 tokens which translates to the agent sometimes attending to only a few environment timesteps in total. This is especially the case for environments with image observations, where depending on the resolution each observation can result in more than one hundred tokens each. Hence for certain environments only a short chunk of a demonstration episode fits in the transformer memory. Due to this limited prompt context, preliminary experiments with different prompt structures resulted in very similar performance. Similarly, early evaluations of the model using prompt-based in-context learning on new environments did not show a significant performance improvement compared to prompt-less evaluation in the same setting. Kontextová dĺžka je preto súčasným obmedzením našej architektúry, hlavne kvôli štvorcovému škálovaniu sebavedomia.Mnohé nedávno navrhnuté architektúry umožňujú dlhší kontext s väčšou účinnosťou a tieto inovácie by mohli potenciálne zlepšiť výkonnosť našich agentov. 9 Conclusions Sekvenčné modely transformátorov sú účinné ako multi-task multi-embodiment politiky, vrátane pre real-world text, videnie a robotické úlohy. Ukážu sľub aj v niekoľkých úderov-out-of-distribúcie úlohy učenia. V budúcnosti, takéto modely by mohli byť použité ako predvolený východiskový bod prostredníctvom výzvy alebo jemné nastavenie naučiť sa nové správanie, skôr než školenie od začiatku. Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent. Acknowledgments Radi by sme poďakovali Danovi Horganovi, Manuelovi Kroissovi, Mantasovi Pajarskasovi a Thibaultovi Sottiauxovi za pomoc s infraštruktúrou ukladania dát; Jean-Baptiste Lespiauovi a Fan Yangovi za pomoc pri súbežnom hodnotení; Joelovi Venessovi za poradenstvo o návrhu modelu; Korayovi Kavukcuogluovi za pomoc pri inšpirácii projektu a uľahčení spätnej väzby; Tomovi Erezovi za poradenstvo o návrhu agenta a výbere úlohy pre nepretržitú kontrolu; Igorovi Babuschkinovi za pomoc pri kódovaní počiatočného prototypu; Jackovi Raeovi za poradenstvo o kódovej báze jazyka transformátorov; Thomasovi Lampeovi za budovanie robotickej infraštruktúry a Autorské príspevky vyvinul projektový koncept, napísal počiatočný prototyp a viedol celý projekt. viedol vývoj architektúry pre víziu a text, vybudoval infraštruktúru pre tokenizáciu a prompting a významne prispel k celkovému vývoju a hodnoteniu agentov. Scott Reed Konrad Żołna led work on optimizing the transformer architecture, ran the largest number of experi-ments, and analyzed scaling law properties and in-distribution agent performance. Emilio Parisotto was the technical lead, responsible for creating a scalable data loader and evaluator supporting hundreds of tasks at once, and for the initial robot integration with Gato. Sergio Gómez Colmenarejo vyvinul model vrátane vzorky pre počiatočný prototyp, vykonal ex-perimenty so zameraním na robotiku a vytvoril vizualizácie. Alexander Novikov built scalable storage infrastructure to provide Gato with SoTA-level agent expe-rience in Atari and other domains. Gabriel Barth-Maron conducted large scale agent data collection, built substantial data loading infrastructure, and integrated large scale visual-language datasets into the training of Gato. Mai Giménez contributed broadly to the Gato codebase including a bespoke distributed training sequence loader, and led the development of benchmarks for out-of-distribution generalization, and the training of competitive baseline agents. Yury Sulsky supported physical robotics infrastructure, conducted numerous evaluations and experiments to analyze the generalization properties of Gato, and contemplated broader ethical impact. Jackie Kay guided Gato’s deployment to the physical robot, provided strong existing base-lines for block stacking, and advised on model development and experimental design. Jost Tobias Springenberg developed the Gato dialogue and image captioning demonstrations, allowing users to easily probe the vision and language capacities of agents in development. Tom Eccles contributed to agent design as well as control datasets and environments with randomized physics and morphology variations. Jake Bruce helped in exploring vision architectures. Ali Razavi contributed to the first prototype of Gato that worked on Atari, in addition to exploring alternative network architectures and training objectives. Ashley Edwards advised on agent design, experiment design and task selection, especially for continuous control applications. Nicolas Heess advised on model design and experiments, and provided feedback in regular meetings. Yutian Chen advised on the design and planning of robotics efforts. Raia Hadsell advised on all aspects of the project, especially model architecture, training strategies and benchmark design. Oriol Vinyals was the primary project manager; eliciting key goals, tracking progress, facilitating pre-sentations and feedback, and coordinating resource planning. Mahyar Bordbar oversaw the project from its inception. Nando de Freitas References Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation. , 2018. Preprint arXiv:1806.06920 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. , 2020. Preprint arXiv:2005.00928 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. , 2022. Preprint arXiv:2204.01691 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman a Karen Simonyan. v roku 2022. Preprint arXiv:2204.14198 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. , 2016. Predtlač arXiv:1606.06565 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In , pp. 2425–2433, 2015. International Conference on Computer Vision Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. , 2016. Preprint arXiv:1607.06450 Paul Bach-y Rita and Stephen W Kercel. Sensory substitution and the human-machine interface. , 7(12):541–546, 2003. Trends in cognitive sciences Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. , 2022. Preprint arXiv::2206.11795 Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess a Timothy Lillicrap. Distribuované distribučné deterministické politické gradienty. , 2018. Preprint arXiv:1804.08617 Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. v roku 2016. Preprint arXiv:1612.03801 Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. , 47:253 až 279, 2013. Journal of Artificial Intelligence Research Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. , 2021. Preprint arXiv:2108.07258 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. , 2021. Preprint arXiv:2112.04426 Nick Bostrom. . Dunod, 2017. Superintelligence Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. , 2016. Preprint arXiv:1606.01540 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. In , pp. 1877–1901, 2020. Advances in Neural Information Processing Systems Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. , 2019. Preprint arXiv:1909.12200 Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the-wild" human videos. , 2021a. Preprint arXiv:2103.16817 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar-avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. , 34, 2021b. Advances in Neural Information Processing Systems Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. , 2021c. Preprint arXiv:2107.03374 Tao Chen, Adithyavairavan Murali, and Abhinav Gupta. Hardware conditioned policies for multi-robot transfer learning. , 31, 2018. Advances in Neural Information Processing Systems Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In , 2022. ICLR Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. , 2015. Preprint arXiv:1504.00325 Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen a Yoshua Bengio. BabyAI: Platforma na štúdium efektívnosti vzoriek založeného učenia sa jazykov. , 2018. Preprint arXiv:1810.08272 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. , 2022. Preprint arXiv:2204.02311 Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In , pp. 2048–2056, 2020. Medzinárodná konferencia o strojovom učení Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In , pp. 2978–2988, 2019. Annual Meeting of the Association for Computational Linguistics Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In , pp. 2169–2176, 2017. IEEE International Conference on Robotics & Automation Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. , 2018. Preprint arXiv:1810.04805 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. v roku 2020. Predtlač arXiv:2010.11929 Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. In , pp. 1407–1416, 2018. Medzinárodná konferencia o strojovom učení Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. , 2020. Preprint arXiv:2004.07219 Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. , 2021. Predbežná tlač arXiv:2111.10364 Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. , 33:7248 – 7259, 2020 Advances in Neural Information Processing Systems Jeff Hawkins and Sandra Blakeslee. . Macmillan, 2004. o inteligencii Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pp. 770–778, 2016a. IEEE Computer Vision and Pattern Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In , pp. 630–645, 2016b. European Conference on Computer Vision Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In , pp. 9729–9738, 2020. IEEE Computer Vision and Pattern Recognition Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). , 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Preprint arXiv:1606.08415 Multi-task deep reinforcement learning with popart. In , 2019. AAAI Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization. , 2021. Preprint arXiv:2104.06159 Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. , 9(8):1735–1780, 1997. Neural computation Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. , 2022. Preprint arXiv:2203.15556 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. v roku 2016. Preprint arXiv:1603.09382 Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In , pp. 4455–4464, 2020. International Conference on Machine Learning Wenlong Huang, Pieter Abbeel, Deepak Pathak a Igor Mordatch. Jazykové modely ako plánovači s nulovým nárazom: extrahovanie praktických vedomostí pre stelesnených agentov. v roku 2022. Preprint arXiv:2201.07207 David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. , 2020. Preprint arXiv:2007.12770 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. , 2021. Preprint arXiv:2107.14795 Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. 34, v roku 2021. Advances in Neural Information Processing Systems Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In , pp. 4904–4916, 2021. International Conference on Machine Learning Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In , pp. 3874–3884, 2019. Konferencia severoamerickej kapitoly Združenia pre výpočtovú lingvistiku: ľudské jazykové technológie John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. , 596(7873):583–589, 2021. Nature Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. , 2017. Preprint arXiv:1706.05137 Anssi Kanervisto, Joonas Pussinen, and Ville Hautamäki. Benchmarking end-to-end behavioural cloning on video games. In , pp. 558–565, 2020. Konferencia IEEE o hrách (CoG) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. , 2020. Preprint arXiv:2001.08361 Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In , 2018. International Conference on Learning Representations Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik a Geoffrey Irving. , 2021. Preprint arXiv:2103.14659 Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. v roku 2019. Preprint arXiv:1909.05858 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. , 2014. Predtlač arXiv: 1412.6980 Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In , Annual Meeting of the Association for Computational Linguistics pp. 66–71, 2018. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control. , 2020. Preprint arXiv:2010.01856 Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis, Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In , 2021. Conference on Robot Learning Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol-maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation. , 2022. Preprint arXiv:2205.03353 Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making. , 2022a. Preprint arXiv:2202.01771 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode. , 2022b. Preprint arXiv:2203.07814 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. , 2017. Preprint arXiv:1711.05101 Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A visual question answering benchmark requiring external knowledge. In ,str. 3195 – 3204, 2019 IEEE Computer Vision and Pattern Recognition Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. , 2022. Preprint arXiv:2203.11147 Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji a Timnit Gebru. , pp. 220–229, 2019. Proceedings of the conference on fairness, accountability, and transparency Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski a ďalší. , 518(7540):529–533, 2015. Nature Vernon Mountcastle. An organizing principle for cerebral function: the unit module and the distributed system. , 1978. The mindful brain Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. , 2021. Preprint arXiv:2112.09332 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. , 2016. Preprint arXiv:1609.03499 Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. , 2021. Preprint arXiv:2110.10819 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray a ďalší. , 2022. Preprint arXiv:2203.02155 Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec-tiveness of pre-trained vision models for control. , 2022. Preprint arXiv:2203.03580 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. , 2020. Preprint arXiv:2007.03001 Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. , 30, 2017. Advances in Neural Information Processing Systems Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. , 2021. Preprint arXiv:2112.11446 Scott Reed and Nando De Freitas. Neural programmer-interpreters. In , 2016. International Conference on Learning Representations Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help offline reinforcement learning? v roku 2022. Preprint arXiv:2201.12122 Stuart Russell. . Penguin, 2019. Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Ľudská kompatibilita: umelá inteligencia a problém kontroly Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. , 2016. Preprint arXiv:1606.04671 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. In , 2022. International Conference on Learning Representations Jürgen Schmidhuber. One big net for everything. v roku 2018. Preprint arXiv:1802.08864 Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. , 588(7839):604–609, 2020. Nature Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hyper-nymed, image alt-text dataset for automatic image captioning. In , pp. 2556–2565, 2018. Annual Meeting of the Association for Computational Linguistics Noam Shazeer. Glu variants improve transformer. , 2020. Preprint arXiv::2002.05202 H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policy maximum a posteriori optimalizácia politiky pre diskrétnu a nepretržitú kontrolu. , 2020. ICLR Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. , 15(56): 1929–1958, 2014. Journal of Machine Learning Research Richard Sutton. The bitter lesson. , 13:12 , 2019 Incomplete Ideas (blog) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. v roku 2018. Predbežná tlač arXiv:1801.00690 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. v roku 2022. Preprint arXiv:2201.08239 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In , str. 5026 až 5033, 2012 International Conference on Intelligent Robots and Systems Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals a Felix Hill. , pp. 200–212, 2021. Advances in Neural Information Processing Systems Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess a Yuval Tassa. dm_control: Softvér a úlohy pre kontinuálnu kontrolu. , 6:100022, 2020. Software Impacts Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. , 30, 2017. Advances in Neural Information Processing Systems Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. , 2021. Predtlač arXiv:2108.10904 Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess a ďalší. , 33:7768 až 7778, 2020 Pokroky v systémoch neurálneho spracovania informácií Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. , 2021. Preprint arXiv:2109.01652 Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. v roku 2021. Preprint arXiv:2112.04359 Yuxin Wu and Kaiming He. Group normalization. In , str. 3 až 19, 2018 European Conference on Computer Vision Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn a Sergey Levine. Meta-World: Referenčná hodnota a hodnotenie pre multi-task a meta posilnenie učenia. , pp. 1094–1100, 2020. Conference on Robot Learning Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. , 2022. Preprint arXiv:2202.05607 Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas a Scott Reed. , 2020. Preprint arXiv:2011.13885 Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gómez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas a Ziyu Wang. Úlohou súvisiace učenie o imitácii súperov. , pp. 247–263, 2021. Konferencia o robotickom vzdelávaní Doplnkové materiály A Model card Predstavíme modelovú kartu pre mačku v stole 4. Table 4: We follow the framework proposed in Gato Model Card. (Mitchell et al., 2019). B Agent Data Tokenization Details In this section we provide additional details on our tokenization schemes. Our agent data is sequenced as follows: • are presented to the agent in order of time (timesteps). Episodes • in turn are presented in the following order: Timesteps ([ 1: 1: 1: ]) are ordered lexicographically by key, each item is sequenced as follows: – Observations y k, x m, z n ∗ Text tokens ( 1: ) are in the same order as the raw input text. y k ∗ Image patch tokens ( 1 : → ) are in raster order. x m ∗ Tensors ( 1: ) (ako sú diskrétne a nepretržité pozorovania) sú v radovom poradí. z n – (' "); určený token separátora je poskytnutý po pozorovaní. Separator | – ( 1: ) are tokenized as discrete or continuous values and in row-major order. Actions a A A full sequence of tokens is thus given as the concatenation of data from T timesteps: where L = T(k + m + n + 1 + A) is the total number of tokens. Každý prvok plávajúceho bodu tensorov v sekvencii pozorovania je kompostovaný ako v WaveNet (Oord et al., 2016): with parameters µ = 100 and M = 256. (If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range \[ 1, 1\] for all our environments.) All the elements are subsequently clipped so that they fall in the set \[ 1, 1\]. Finally, they are discretized using bins of uniform width on the domain \[ 1,1\]. We use 1024 bins and shift the resulting integers so they are not overlapping with the ones used for text tokens. The tokenized result is therefore a sequence of integers within the range of \[32000, 33024). See Figure and Figure for visualizations of tokenizing and sequencing values (both discrete and con-tinuous) and images. See Section for details about local position encodings referenced in the figures. 14 15 C Modelová architektúra C C.1 Transformer Hyperparameters The transformer hyperparameters of Gato are presented in Table We also list the hyperparameters of smaller architecture variants used in Section 5. 5. C.2 Embedding Function The ResNet block uses the v2 architecture contains GroupNorm s 32 skupinami namiesto LayerNorm a želé activation functions instead of RELU. The block is diagrammed in Figure On a Al., 2016b), (Wu a on, 2018) Ba et al, 2016), (Hendrycks & Gimpel, 2016) 16. C.3 Poloha kódov Po zmapovaní tokenov do tokenových vkladacích prvkov sa do tokenových vkladacích prvkov (v prípade potreby) pridajú dve kódovania polohy, aby sa modelu poskytli časové a priestorové informácie. Patch pozície kódovanie Tieto kódovanie polohy prenášajú informácie o globálnej polohe náplasti v rámci obrazu, z ktorého bola náplast extrahovaná. Po prvé, relatívne riadkové a stĺpcové intervaly náplasti sa vypočítajú normalizáciou pixelových intervalov náplasti rozlíšením obrazu. Normalizované intervaly riadka a stĺpca sú potom kvantifikované do slovnej veľkosti (používame 128) a používajú sa na indexovanie riadkovej a stĺpcovej tabuľky učiteľných kódovaní polohy. Spôsob, akým sa kvantifikované riadkové a stĺpcové intervaly konvertujú na indexy, závisí od toho, či trénujeme alebo hodnotíme model: počas tréningu sa náhodný index rovnomerne odoberá z kvantifikovaného intervalu, zatiaľ To more concretely demonstrate this process, we provide an example in Figure [17.](#_bookmark144) We will follow the process with the patch highlighted in red on the left of the subfigure. The image is of resolution 80 64 and each patch is 16 16, meaning there are 5 4 = 20 patches total. The highlighted patch starts at pixel row interval \[16*,* 32\] and pixel column interval \[32*,* 64\]. Normalized, the row interval is therefore \[0*.*25*,* 0*.*5\] and the column interval is \[0*.*4*,* 0*.*6\]. We then separately quantize the intervals into 128 uniformly spaced bins, with the resulting quantized row interval being \[32*,* 64\] and the quantized column interval being \[51*,* 77\]. During training, we uniformly sample integers between the quantized row intervals, whereas during testing we would use the means, which are index 48 for row position and index 64 for column position. The row and column positions are finally used to index separate row and column position encoding tables to produce learnable embeddings which are added onto the corresponding patch token embedding. Local Observation Position Encodings Po prvé, opätovne zdôrazňujeme, že počas tokenizácie sú pre každý časový krok všetky prvky pozorovacieho súboru tokenizované do sekvencií a spojené do pozorovacej sekvencie. Každému tokenu v tejto pozorovacej sekvencii je daný index, ktorý zodpovedá poradí sekvencie, t.j. prvý token je 0 a posledný je dĺžka pozorovacieho sekvencie mínus jeden. Po vložení sa pre akékoľvek tokeny, ktoré boli súčasťou pozorovacieho súboru, zodpovedajúci index pozorovacieho tokenu používa na vloženie tabuľky učiteľných kódov polohy, s jedným vložením pre každý možný index pozorovacieho tokenu (v praxi jednoducho nastavíme tabuľku na veľkú hodnotu ako 512). / Kódovanie pozície sa potom pridáva do vkladu pozorovacieho tokenu, aby sa vytvorilo konečné vkladanie tokenu. Všimnite si, že všetky akčné tokeny majú rovnaké kódovanie pozície bez ohľadu na ich pozíciu v sekvencii časových krokov. 18. D Predbežné nastavenie For all models we use the AdamW optimizer with a linear warm-up and cosine schedule decay. The linear warmup lasts for 15*,* 000 steps, starting from a learning rate of 1e-7 and ending at a different maximum learning rate depending on the model (see Table This learning rate is then cosine decayed by a factor 10x over 1,000,000 steps. The AdamW optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. We use a batch size of 512 and a sequence length of 1024 tokens for all models. Optimizer: Loshchilov a Hutter, 2017) 6). β 9, β ϵ Trénujeme s parametrom rozkladu hmotnosti AdamW 0,1. Navyše používame stochastickú hĺbku during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. Regularization: (Slovenský a al., 2016) Fine-tuning nastavenie For all models we use the Adam optimizer with a constant learning rate of 1e-5. The Adam optimizer has parameters 1 = 0 2 = 0.*95 and = 1e-8. We use a batch size of 64 and a sequence length of 1024 tokens for all models. We train for 10,000 gradient steps. Optimizer: (Kingma & Ba, 2014 ) β 9, β ϵ We use dropout with a rate of 0.1. Regularization: (Srivastava et al., 2014) We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The final fine-tuning performance is defined as the maximum of these smoothed scores. Evaluation: We generated data for the fine-tuning tasks the same way we did for the other tasks (see Section 3.1 for details). Instead of using all the data for a fine-tuning task, we discarded all but 2000 best episodes (leading to the highest returns). The fine-tuning datasets were created in the following way. We randomly took 1000 episodes (out of 2000 preselected episodes), then a subset of 100 episodes from the selected episodes, then 10, 5, 3, and finally a single episode. We repeated this procedure 3 times to obtain 3 series of cascading subsets for each task. Each subset is used to conduct one fine-tuning experiment, and each is reported on our plots in Section as a separate point. Datasets: 5.2 We have not altered any of the tasks and used their canonical versions. As 3 out of 4 tasks are open sourced, they do not need further explanation. For the fourth task, DMLab order_of_apples_forage_simple, the goal is to collect apples in the right order, green ones first followed by the gold one. Task settings: F Data Collection Details F.1 Atari We collect two separate sets of Atari environments. The first (that we refer to as ALE Atari) consists of 51 canonical games from the Arcade Learning Environment The second (that we refer to as ALE Atari Extended) is a set of alternative games with their game mode and difficulty randomly set at the beginning of each episode. (Bellemare et al., 2013). 3 Pre každé prostredie v týchto súboroch zhromažďujeme údaje výcvikom Muesli agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. (Slovenský a al., 2021) F.2 Sokoban Sokoban is a planning problem in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ahead of time is therefore necessary to succeed at this puzzle. We use a Muesli Agent zhromažďuje údaje o výcviku. (Racanière et al., 2017), (Hessel et al., 2021) F.3 BabyAI BabyAI is a gridworld environment whose levels consist of instruction-following tasks that are described by a synthetic language. We generate data for these levels with the built-in BabyAI bot. The bot has access to extra information which is used to execute optimal solutions, see Section C in the appendix of for more details about the bot. We collect 100,000 episodes for each level. (Chevalier-Boisvert et al., 2018 → F.4 DeepMind Control Suite The DeepMind Control Suite (T . , , je súbor simulačných prostredí založených na fyzike. Pre každú úlohu v ovládacom balíku zhromažďujeme dve disjointné súbory údajov, z ktorých jedna používa iba stavové funkcie a druhá iba pixely. agent na zhromažďovanie údajov z úloh s funkciami stavu a MPO based agent to collect data using pixels. unyasuvunakool et al v roku 2020; Tassa et al., 2018 → (Barth-Maron et al., 2018) (Zdroj: Abdolmaleki a ďalší) 2018) Zhromažďujeme tiež údaje pre náhodné verzie úloh ovládacieho súboru s agentom D4PG. Tieto verzie náhodne určujú prevodovku aktuátora, rozsah spojov, tuhosť a tlmenie a veľkosť a hustotu geomu. Pre náhodné verzie existujú dve nastavenia obtiažnosti. Malé nastavenie meria hodnoty náhodným číslom odobratým zo zväzu intervalov [0*. ,* 0*. [Zdroj: 1 ,* 1*. . „... 0“ [1. , * 1 * * 4 . 9 95] ∪ 05 1]. The large setting scales values by a random number sampled from the union of intervals [0 6 8] ∪ 2 F.5 DeepMind Lab DeepMind Lab , je prvá osoba 3D prostredie navrhnuté tak, aby učil agentov 3D víziu z surových pixelových vstupov s egocentrickým pohľadom, navigáciou a plánovaním. (Beattie et al. 2016 → Vycvičili sme Impalu Agent spoločne na 18 úrovniach materskej DM Lab, ktoré vytvárajú mapy postupne pre každú novú epizódu.Dáta boli zhromaždené vykonaním agenta na týchto 18 úrovniach, rovnako ako ďalšia sada 237 úrovní ručne vyrobených na testovanie rôznorodej sady zručností. (Vyhľadávanie et al. 2018 → 18 úrovní rodičov sa vyznačuje vysokou rozmanitosťou generovaných máp. Rozdiel medzi úrovňami je zakorenený v hyper-parametroch používaných v procese generovania. Tieto hyper-parametre ovládajú charakteristiky na vysokej úrovni, ako sú typy štruktúr, ťažkosti s jazykovými inštrukciami alebo prítomnosť špecifických nástrojov. Na rozdiel od rodičovských úrovní, každá z ďalších ručne vyrábaných 237 úrovní používa takmer rovnakú mapu a hlavnými rozdielmi medzi príkladmi mapy rovnakej úrovne sú estetiky, ako sú farby stien alebo podmienky osvetlenia. procedurally generated and were designed to test a diverse set of skills such as walking up stairs or using specific tools. They are similar to levels presented in Figure 3, Figure 7 and Figure 8 in aforementioned paper by nie Beattie et al. V roku 2016 Ďalšie informácie o 18 rodičovských úrovniach (a ich vzťahu s ostatnými úrovňami) sú podrobne uvedené v seminári NeurIPS podľa Daniela Tanis . A Methodology for RL Environment Research 4 Celkovo sme zhromaždili údaje pre 255 úrovní z DeepMind Lab (18 materských úrovní a 237 ručne vyrábaných úrovní), z ktorých 254 bolo použitých počas výcviku Gato. F.6 Procgen Benchmark Procgen je súbor 16 procedurálne generovaných prostredí podobných Atari, ktoré boli navrhnuté na referenčnú účinnosť vzorky a zovšeobecnenie v posilňovacom učení. Použili sme nastavenie ťažkosti pre všetky prostredia okrem bludiska a lúpeže, ktoré sme nastavili na jednoduché. (Cobbe et al., 2020) (Kapturowski et al., 2018 → F.7 Modulárne RL Modulárne RL je zbierka MuJoCo (T based continuous control environments, composed of three sets of variants of the OpenAI Gym Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of morphologies is generated by enumerating all possible subsets of limbs, and keeping only those sets that a) contain the torso, and b) still form a connected graph. This results in a set of variants with different input and output sizes, as well as different dynamics than the original morphologies. We collected data by training a single morphology-specific D4PG agent on each variant for a total of 140M actor steps, this was done for 30 random seeds per variant. (Slovenský a al., v roku 2020) Odorov et al. 2012 ) (Slovenský a al., 2016 → F.8 DeepMind Manipulation Playground Hracie ihrisko DeepMind Manipulation is a suite of MuJoCo based simulated robot tasks. We collect data for 4 of the Jaco tasks (box, stack banana, insertion, and slide) using a Critic-Regularized Regression (CRR) agent trained from images on human demonstrations. The collected data includes the MuJoCo physics state, which is we use for training and evaluating Gato. (Zoznamka a al., 2021) (Slovenský a al., 2020) F.9 Meta-World Meta-World (Y Je to súbor prostredí pre benchmarking učenia meta-zosilnenia a multi-task učenia. zhromažďujeme údaje zo všetkých tréningových a testovacích úloh v režime MT50 školením agenta MPO with unlimited environment seeds and with access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state. u et al., 2020) 5 (Zdroj: Abdolmaleki a ďalší) 2018 → G Skutočné robotické hodnotenie podrobnosti In the real world, control is asynchronous; physics does not wait for computations to finish. Thus, inference latency is a concern for evaluating a large model for real world tasks. In robotics, a fast control rate is thought to be critical for reacting to dynamic phenomena. The robot setup for RGB stacking has a 20Hz control rate (0.05 second timestep) by design. In order to reach an acceptable margin of latency, we modified inference at evaluation time by shortening the context length to 1. We also implemented a parallel sampling scheme where all the action tokens are zeroed out in the input sequences during training so we can sample all tokens corresponding to a robot action in a single model inference step instead of autoregressively as it’s done in other domains. We found that the 1.18B parameter model was able to run on the hardware accelerators in our robots (NVidia GeForce RTX 3090s), but still overran the 20Hz control rate by a small amount (~0.01 seconds). Používame funkciu sparse reward opísanú v pre filtrovanie údajov. Vyberáme iba trajektórie s úspech úlohy; to znamená, že len malá odmena 1 na konečnom časovom stupni. Lea a al. (2021) → Konečný H Skill Mastery architecture The numbers reported for the Skill Mastery benchmark were collected by executing a model zero-shot that used an earlier version of the Gato architecture. Instead of the ResNet patch embedding, a similar architecture using a local transformer was used to embed image patch tokens. The local position embeddings and patch position embeddings were not used. These changes were implemented and found to improve Gato’s performance after the pretraining data was changed (as we decided to focus on Skill Generalization instead of Skill Mastery challenge), which is why they are presented as the final architecture of our full model. Ďalšie robotické ablácie V simulácii sme vykonali sériu ablácií, aby sme lepšie pochopili účinok rozmanitých predtréningových údajov v oblasti robotiky (pozri Obrázok). We included the same baselines as in Section Výber varianty veľkosti parametrov 364M, ako aj ďalší základný riadok vyškolený iba s údajmi z ovládacieho balíka. DM Control-only agent je lepší ako základný Gato pri nulovom prenose a s množstvom dát s jemným nastavením, čo naznačuje, že Gato nemusí používať reprezentácie získané z textových dátových súborov pri adaptácii na robotické úlohy. Ten istý doménový jediný agent vykonáva najlepší celkový výkon, zodpovedá základnému CRR v 1 epizóde s jemným nastavením a prevyšuje ho s viac dátami, čo naznačuje, že Gato v súčasnom meradle môže obchodovať so svojou kapacitou generalizácie pre dátovo efektívnu a efektívnu adaptáciu s niekoľkými údermi. 19 ) 5.2, J Attention visualization To render the transformer attention weights, we retrieved the cross-attention logits, a tensor with dimension ( a) kde je počet hláv a je počet tokenov v sekvencii. ( )th entry of this matrix can be interpreted as the amount that head Čakať na token od token Vzhľadom na Gatoho schému tokenizácie obrazu existuje viac tokenov na časový úsek. Preto, aby sme venovali pozornosť konkrétnemu časovému úseku, vzali sme podmatricu, ktorá zodpovedá tomuto časovému úseku. Potom sme aplikovali softmax na riadky tejto matrice, aby sme normalizovali príslušné hodnoty. Pretože sme sa zaujímali len o pozornosť predchádzajúcich tokenov, vylúčili sme diagonál nastavením na negatívnu nekonečnosť pred softmax. H, T, T H T H, I, J h j i Aby sme zmerali dôležitosť každého náplasti, priemerali sme váhy pozornosti nad zodpovedajúcim stĺpcom. Pretože Gato používa kauzálny transformátor, matrica pozornosti je nižšia ako trojuholník, takže priemer bol zohľadnený iba nad podstĺpcom pod diagonálou matrice. Pomocou tejto metódy sme zistili, že mapy pozornosti v prvej vrstve transformátora sú najviac interpretovateľné, v súlade s zisteniami Niektoré hlavičky jasne sledujú špecifické entity a regióny obrazu. shows the attention maps for manually selected heads at the first layer for several tasks. Spoločnosť Abnar & Zuidema 2020 → 20 K Podrobné výsledky pre špecialistu Meta-World agent The specialist Meta-World agent described in Section dosahuje 96,6% priemernú úspešnosť vo všetkých 50 úlohách Meta-World. Podrobné úspešnosti sú uvedené v tabuľke We evaluated agent 500 times for each task. 5.5 7. L Per-domain výsledky pre Mačka We describe performance of Gato for simulated control tasks in Section In Table prezentujeme normalizované výsledky pre každú doménu. Hodnotili sme agenta 50 krát pre každú úlohu. 4.1. 8, Tento dokument je k dispozícii v archíve pod licenciou CC by 4.0 Deed (Attribution 4.0 International). Tento papier je Podlieha licencii CC by 4.0 Deed (Attribution 4.0 International). Dostupné v archíve