DeepMind's Gato pokazuje, jak jedna AI może nauczyć się wszystkiego na raz

Autorzy : Scott Reed Konrad Żołna Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Giménez Yury Sulsky Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas Autorzy : Scott Reed Konrad Żołna Emilio Parisotto jako Sergio Gómez Kolmenarejo Aleksander Nowikow jako Gabriel Barth-Maron Nigdy Gimnazjum Jurij Sulski Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi jako Ashley Edwards Nicole Heess Zbigniew Chen Rafał Hadsell Oryginalne winyle Małgorzata Bordbar Nando de Freitas abstrakcyjna Zainspirowany postępami w modelowaniu języka na dużą skalę, stosujemy podobne podejście do budowania pojedynczego agenta generalistycznego poza zakresem wyjść tekstowych. Agent, który nazywamy Gato, działa jako wielomodalna, wieloplatkowa, wielowarstwowa polityka generalistyczna. Ta sama sieć z tymi samymi wagami może grać w Atari, obrazy napisów, czat, bloky stackowe z prawdziwym ramieniem robota i wiele innych, decydując na podstawie jego kontekstu, czy wydrukować tekst, obroty wspólne, naciśnięcia przycisków lub inne tokeny. 1 Wprowadzenie There are significant benefits to using a single neural sequence model across all tasks. It reduces the need for hand crafting policy models with appropriate inductive biases for each domain. It increases the amount and diversity of training data since the sequence model can ingest any data that can be serialized into a flat sequence. Furthermore, its performance continues to improve even at the frontier of data, compute and model scale Historycznie modele ogólne, które są lepsze w wykorzystaniu obliczeń, również miały tendencję do wyprzedzania bardziej wyspecjalizowanych podejść specyficznych dla domeny. W końcu . (Rzeczpospolita et al., w 2020 roku; Hoffmann et al. w 2022 roku). a potem Sutton, 2019 roku), W tym artykule opisujemy bieżącą iterację agenta ogólnego przeznaczenia, który nazywamy Gato, instancjonowany jako pojedynczy, duży model sekwencji transformatorów.Z pojedynczym zestawem wag, Gato może angażować się w dialog, obrazy napisów, gromadzić bloki z prawdziwym ramieniem robota, wyprzedzać ludzi w grze Atari, poruszać się w symulowanych środowiskach 3D, postępować zgodnie z instrukcjami i więcej. Chociaż żaden agent nie może się spodziewać doskonałości we wszystkich wyobrażalnych zadaniach kontrolnych, zwłaszcza tych daleko poza jego dystrybucją szkoleniową, testujemy tutaj hipotezę, że szkolenie agenta, który jest ogólnie zdolny na Podejmujemy hipotezę, że takiego agenta można uzyskać poprzez skalowanie danych, obliczania i parametrów modelu, nieustannie rozszerzając dystrybucję szkolenia przy zachowaniu wydajności, w kierunku pokrycia jakiegokolwiek zadania, zachowania i wcielenia zainteresowania. Duża liczba Skoncentrujemy nasze szkolenie na punkcie operacyjnym skali modelu, który pozwala na kontrolę w czasie rzeczywistym robotów w świecie rzeczywistym, obecnie w zakresie parametrów 1.2B w przypadku Gato. Ponieważ poprawiają się architektury sprzętowe i modelowe, ten punkt operacyjny naturalnie zwiększy wykonalny rozmiar modelu, popychając modele generalistyczne wyżej do krzywej prawa skalowania. Dla prostoty Gato został przeszkolony w trybie offline w sposób czysto nadzorowany; jednak w zasadzie nie ma powodu, dla którego nie mógłby być również przeszkolony za pomocą uczenia się wzmocnienia offline lub online (RL). 2 Model Podstawową zasadą projektowania Gato jest szkolenie w zakresie możliwie najszerszej różnorodności istotnych danych, w tym różnorodnych form, takich jak obrazy, tekst, propriocepcja, wspólne momenty obrotowe, naciśnięcia przycisków oraz inne dyskretne i ciągłe obserwacje i działania. Aby umożliwić przetwarzanie tych danych wielomodalnych, serializujemy wszystkie dane w płaską sekwencję tokenów. W tej reprezentacji Gato można szkolić i pobierać próbki podobnie do standardowego modelu językowego na dużą skalę. Podczas wdrażania, próbkowane tokeny są montowane w odpowiedzi dialogowe, napisy, naciśnięcia przycisków lub inne działania oparte na kontekście. 2.1 Tokenizacja Istnieje nieskończona liczba możliwych sposobów przekształcenia danych w tokeny, w tym bezpośrednio przy użyciu podstawowego przepływu bajtów.Poniżej informujemy o schemacie tokenizacji, który znaleźliśmy, aby uzyskać najlepsze wyniki dla Gato w obecnej skali przy użyciu współczesnych architektur sprzętowych i modelowych. Tekst jest kodowany za pośrednictwem SentencePiece (Kudo & Richardson, 2018) z 32000 podrzędami w zakresie liczb całkowitych [0, 32000]. Obrazy są najpierw przekształcane w sekwencje niepowtarzających się 16 16 plastrów w kolejności rasterowej, jak to zrobiono w ViT (Dosovitskiy et al., 2020). Każdy piksel w obrazie __p__atches jest następnie znormalizowany między [−1*,* 1] i podzielony przez kwadratowy korzeń rozmiaru plastra (tj. √16 = 4). Wartości dyskretne, np. naciśnięcia przycisku Atari, są wygładzane na sekwencje liczb całkowitych w kolejności rzędu głównego. Wartości ciągłe, np. wejścia propriocepcyjne lub obroty łączne, są najpierw wyrównane do sekwencji wartości pływających punktów w kolejności rzędu głównego. Wartości są kodowane do zakresu [ 1*,* 1] jeśli już tam nie są (patrz rysunek 14 dla szczegółów), a następnie dyskretyzowane do 1024 jednolitych bin. Po przekonwertowaniu danych na tokeny używamy następującego kanonicznego sortowania sekwencji. Tekst tokenów w tej samej kolejności co surowy tekst wejściowy. Obraz patch tokenów w kolejności raster. Tensory w kolejności głównej. Struktura w kolejności lexikograficznej według klucza. Czasy agentów jako tokeny obserwacyjne, a następnie separator, a następnie tokeny akcji. Agent odcinki jako etapy w porządku czasowym. Dalsze szczegóły dotyczące danych tokenizujących agentów przedstawiono w materiałach uzupełniających (sekcja b) w 2.2 Wbudowanie tokenów wejściowych i ustawienie celów wyjściowych Po tokenizacji i sekwencjonowaniu stosujemy parametryzowaną funkcję osadzania *f* ( ; *θe*) do każdego tokena (tj. jest ona stosowana zarówno do obserwacji, jak i działań) w celu uzyskania ostatecznego wejścia modelu. • Tokeny należące do tekstowych, dyskretnych lub ciągłych ocen obserwacji lub działań na dowolnym etapie czasu są osadzone za pośrednictwem tabeli wyszukiwania w przestrzeni osadzania wektorów. • Tokeny należące do poprawek obrazu na dowolny etap czasu są osadzone przy użyciu pojedynczego ResNet W przypadku osadów tokenów patchów obrazu dodajemy również wektor kodowania pozycji, który można nauczyć się w obrębie obrazu. On i al., 2016 r. a) Odnosimy się do sekcji załącznika Szczegółowe informacje na temat funkcji wstawiania. c3 Ponieważ modelujemy dane autoregresywnie, każdy token jest potencjalnie również etykietą docelową, biorąc pod uwagę poprzednie tokeny. Tokeny tekstowe, wartości dyskretne i ciągłe oraz działania mogą być bezpośrednio ustawione jako cele po tokenizacji. Tokeny obrazu i obserwacje agentów nietekstualnych nie są obecnie przewidywane w Gato, chociaż może to być interesujący kierunek dla przyszłej pracy. Cele dla tych nieprzewidywalnych tokenów są ustawione na wartość niewykorzystaną i ich wkład w stratę jest maskowany. 2.3 Szkolenia W zależności od sekwencji tokenów 1 : i parametrów , modelujemy dane za pomocą zasady prawdopodobieństwa łańcucha: s L θ Pozwól definiujemy funkcję maskującą *m* tak, że *m*(*b, l*) = 1 jeśli token w indeksie *l* pochodzi z tekstu lub z logowanej akcji agenta, a 0 w inny sposób. b Jak opisano powyżej, architektura sieci Gato ma dwa główne składniki: parametryzowaną funkcję osadzania, która przekształca tokeny w osadzenia tokenów, i model sekwencyjny, który wydobywa dystrybucję na następny dyskretny token. Dla uproszczenia i skalowalności Gato korzysta z transformatora tylko z dekoderem o parametrach 1.2B z 24 warstwami, rozmiarem osadzenia 2048 i ukrytym rozmiarem po przekazaniu uwagi 8196 (więcej szczegółów znajduje się w sekcji aswani et al., 2017 roku C 1 ) Ponieważ różne zadania w obrębie domeny mogą mieć identyczne wdrożenia, formaty obserwacji i specyfikacje działań, model czasami potrzebuje dalszego kontekstu, aby rozróżnić zadania. Podczas treningu, dla 25% sekwencji w każdej partii, sekwencja prompt jest zaplanowana, pochodząca z epizodu wygenerowanego przez tego samego agenta źródłowego w tym samym zadaniu. Połowa sekwencji prompt pochodzi z końca epizodu, działając jako forma warunku docelowego dla wielu domen; a druga połowa jest jednolicie pobierana z epizodu. Podczas oceny, agent może zostać zaproszony za pomocą udanej demonstracji pożądanego zadania, co robimy domyślnie we wszystkich wynikach kontroli, które przedstawiamy tutaj. (Rzeszów et al., w 2022 roku; Wej i al. w 2021 roku; Braun i al. w 2020 roku) Szkolenie modelu odbywa się na kawałku 16x16 TPU v3 na 1M kroków o rozmiarze partii 512 i długości sekwencji tokenów = 1024, co zajmuje około 4 dni. szczegóły architektury można znaleźć w sekcji Ponieważ epizody i dokumenty agentów mogą łatwo zawierać wiele więcej tokenów niż pasuje do kontekstu, przypadkowo pobieramy próbki następstw Każda partia miesza podsekwencje w przybliżony sposób równomiernie na wszystkich domenach (np. Atari, MassiveWeb itp.), z ręcznym ważeniem większych i wyższych zbiorów danych (patrz Tabela). w sekcji na szczegóły ) L C . L 1 3 2.4 Rozmieszczenie Wdrażanie kotów jako polityki jest ilustrowane na rysunku Najpierw prompt, taki jak demonstracja, jest tokenizowany, tworząc początkową sekwencję. Domyślnie pobieramy pierwsze 1024 tokeny demonstracji. Następnie środowisko daje pierwszą obserwację, która jest tokenizowana i dołączona do sekwencji. Gato samoregresywnie pobiera próbkę wektoru akcji jeden token na raz. Po pobraniu próbki wszystkich tokenów zawierających wektor akcji (określonych przez specyfikację akcji środowiska), akcja jest dekodowana poprzez odwrócenie procedury tokenizacji opisanej w sekcji Ta akcja jest wysyłana do środowiska, które kroki i przynosi nowe obserwacje. Procedura powtarza. Model zawsze widzi wszystkie poprzednie obserwacje i działania w oknie kontekstowym 1024 tokenów. Udało nam się użyć pamięci transformatora XL podczas wdrażania, chociaż nie była używana podczas szkolenia 3. 2.1 Właściwie (Daj et al, 2019 roku). 3 Dane Gato jest przeszkolony w zakresie dużej liczby zbiorów danych obejmujących doświadczenie agenta zarówno w środowiskach symulowanych, jak i rzeczywistych, a także w różnych zbiorach danych języka naturalnego i obrazu. Przybliżona liczba tokenów na zbiór danych sterowania jest obliczana przy założeniu mechanizmu tokenizacji opisanego w sekcji 1. 2.1 Właściwie 3.1 Zadania kontrolne symulowane Nasze zadania kontrolne składają się z zestawów danych generowanych przez specjalistów SoTA lub blisko-SoTA agentów uczenia się wzmocnienia przeszkolonych w różnych różnych środowiskach. Symulowane środowiska obejmują Meta-World (Y wprowadzony do uczenia się meta-wzmocnienia referencyjnego i uczenia się z wieloma zadaniami, Sokoban Proponowany jako problem planowania, BabyAI dla instrukcji językowych następujących w grid-worlds, pakiet DM Control Suite (T dla ciągłej kontroli, a także dla DM Lab zaprojektowany, aby uczyć agentów nawigacji i wizji 3D z surowych pikseli z egocentrycznym punktem widzenia. z klasycznymi grami Atari (używamy dwóch zestawów gier, które nazywamy ALE Atari i ALE Atari Extended, patrz sekcja na szczegóły ) U i al. w 2020 roku) (Racanière et al., 2017 roku (Rzeczpospolita Płock et al., 2018 r. Włocławek et al., w 2020 roku) (Biała i al., 2016 roku (Przemysław i al., 2013 roku) f1 Wprowadzamy również Procgen Benchmark Modułowe RL Obejmujemy również cztery zadania wykorzystujące symulowaną rękę Kinova Jaco z DM Manipulation Playground, jak zaprezentowano w Sekcja zawiera bardziej szczegółowy opis tych zadań kontrolnych, wraz z tym, który agent RL został użyty do generowania danych. (Kobieta i al., w 2020 roku) (Zdrój i al., w roku 2020). Zolna et al. I tak (2020) F Ustaliliśmy, że skuteczne jest szkolenie w filtrowanym zestawie epizodów z zwrotem co najmniej 80% zwrotu eksperta dla zadania. Zwrot eksperta mierzy maksymalną trwałą wydajność, jaką może osiągnąć agent eksperta. definiujemy go jako maksymalny zestaw wszystkich średnich zwrotów okiennych obliczonych dla wszystkich zebranych epizodów dla zadania: gdzie całkowitą liczbę zebranych epizodów dla zadania, jest wielkość okna i Jest to całkowity powrót do epizodu Aby uzyskać dokładne szacunki, w praktyce ustawiamy 10% całkowitej ilości danych lub minimum 1000 odcinków (tj. = min(1000*,* 0*.*1 a) jest N W Ri i W W × N 3.2 Wizja i język Gato jest przeszkolony w MassiveText Zbiór dużych zbiorów tekstów w języku angielskim z wielu źródeł: stron internetowych, książek, artykułów prasowych i kodu. (Rzeszów et al., w roku 2021), Do szkolenia Gato włączyliśmy również kilka zbiorów danych w języku wizji. ALIGN Składa się z 1,8B obrazów i ich alternatywnych tekstów (alt-text). LTIP (Long Text & Image Pairs), składa się z 312 milionów obrazów z napisami , Pojęcia kapitałowe Kokosowe kapsułki , Zestaw danych zawiera odpowiednio 3,3M i 120k pary obrazu-tekstu. ... , zawiera 43 mln stron internetowych, na których wyodrębniono zarówno tekst, jak i obrazy. i VQAv2 z 9K i 443K trójkątami obrazów, pytań i odpowiedzi. Aby utworzyć odcinek szkoleniowy z nich, pobieramy próbkę pięciu par (obraz, tekst), tokenizujemy je, kojarzymy, a następnie podkładamy lub losowo przycinamy do wymaganej długości sekwencji szkoleniowej. (Dziewczyna et al. w roku 2021) (Rzeszów et al. w 2022 roku). (Sharma et al , 2018 r. (Rzeszów et al. 2015 roku) (Rzeszów et al W 2022 roku - marynarzy i al , 2019 roku) (Rzeczpospolita et al., 2015) Robotics - RGB Stacking Benchmark (real i sim) Jako testowy zestaw danych do podejmowania działań fizycznych w świecie rzeczywistym, wybraliśmy środowisko blokowania bloków robotów wprowadzone przez [Lee et al.](#_bookmark89) [(2021).](#_bookmark89) Środowisko składa się z ramienia robota Sawyera z 3-DoF kontrolą prędkości kartezyjnej, dodatkowym DoF dla prędkości i dyskretnym działaniem uchwytującego. Przestrzeń robocza obiektu zawiera trzy bloki z tworzyw sztucznych w kolorze czerwonym, zielonym i niebieskim o różnych kształtach. Dostępne obserwacje obejmują dwa 128 obrazy kamery, ramiona robota i kąt łączenia uchwytu, a także pozycję końcowego efektora robota. Należy zauważyć, że informacje o stanie rzeczywistości W Skill Generalization, zarówno dla symulacji, jak i rzeczywistej, używamy danych zebranych przez najlepszego generalisty agenta sim2real z Gromadzimy dane tylko podczas interakcji z wyznaczonym zestawem RGB (this amounts to a total of 387k successful trajectories in simulation and 15k trajectories in real). For Skill Mastery we used data from the best per group experts from in simulation and from the best sim2real policy on the real robot (amounting to 219k trajectories in total). Note that this data is only included for specific Skill Mastery experiments in Section Lee i al. I tak (2021) Obiekty szkoleniowe Lee i al. (2021) 5.4 Właściwie 4 Capabilities of the generalist agent In this section, we summarize the performance of Gato when trained on the above described data. That is, all results across all tasks are derived from a single pretrained model with a single set of weights. Results with fine-tuning will be presented in Section 5. 4.1 Simulated control tasks figury shows the number of distinct control tasks for which Gato performs above a given score threshold, relative to expert performance demonstrated in Gato’s training data. 5 We report performance as a percentage, where 100% corresponds to the per-task expert and 0% to a random policy. For each simulated control task we trained our model on, we roll out the Gato policy on the corresponding environment 50 times and average the defined scores. As shown in Figure Gato performs over 450 out of 604 tasks at over a 50% expert score threshold. 5, Ale w Atari Gato achieves the average human (or better) scores for 23 Atari games Podczas gdy agenci RL online o pojedynczym zadaniu, którzy generowali dane, wciąż przewyższają Gato, można to przezwyciężyć poprzez dodanie zdolności lub wykorzystanie szkolenia RL w trybie offline, a nie wyłącznie pod nadzorem (patrz sekcja). gdzie przedstawiamy specjalistycznego agenta pojedynczej domeny ALE Atari osiągającego lepsze wyniki niż ludzkie w 44 grach). (Przemysław i al., 2013 roku) 1 5.5 On BabyAI Gato achieves over 80% of expert score for nearly all levels Dla najtrudniejszego zadania, zwanego BossLevel, Gato zdobył 75%.Dwa inne opublikowane bazy, które mogliśmy znaleźć, BabyAI 1.0 i BabyAI 1.1 , scored 77% and 90%, respectively, having trained on this single task alone using a million demonstrations. (Chevalier-Boisvert et al., 2018) 2 (Hui et al. 2020), Na Meta-World (Y Gato achieves more than 50% for all 44 out of 45 tasks that we trained on, over 80% for 35 tasks, and over 90% for 3 tasks. On canonical DM Control Suite (T Gato osiąga więcej niż 50% wyników ekspertów w 21 z 30 zadań od państwa, a ponad 80% w 18 zadaniach. u et al., 2020) asza et al, 2018), 4.2 Robotyka Teleoperacja w pierwszej osobie umożliwia zbieranie ekspertów. Jednak takie demonstracje są powolne i kosztowne do zbierania. Metody klonowania zachowań efektywnych pod względem danych są zatem pożądane do szkolenia manipulatorów robotów generalistycznych, a wstępne szkolenie w trybie offline jest zatem dobrze zmotywowanym obszarem badań. Skill Generalization Performance The Skill Generalization challenge from the RGB Stacking robotics benchmark tests the agent’s ability to stack objects of previously unseen shapes. The agent is trained on a dataset consisting of episodes of the robot stacking objects with a variety of different shapes. Five triplets of object shapes are, however, not included in the training data and serve as test triplets. We evaluated the trained generalist for 200 episodes per test triplet on the real robot. Table pokazuje, że wskaźnik sukcesu naszego agenta generalistycznego na każdym trójkącie testowym jest porównywalny do pojedynczego zadania BC-IMP (filtrowanego BC) w 2 Lee i al. (2021). 4.3 Wzory tekstów Model wykazuje zdumiewające możliwości dialogu i nagrywania obrazów. rysunek zawiera próbkę rep-resentatywną obrazu Gato's image-subtitling performance. Pokazuje kilka ręcznie wybranych przykładów wymiany dialogów tekstowych. 6 7 5 Analysis 5.1 Scaling Laws Analysis In Figure analizujemy łączną wydajność w dystrybucji modelu przedtrenowanego jako funkcję liczby parametrów, aby uzyskać wgląd w to, w jaki sposób wydajność mogłaby się poprawić przy zwiększonej pojemności modelu. Oceniliśmy 3 różne rozmiary modeli (mierzone według liczby parametrów): model 79M, model 364M i model 1.18B (Gato). for details on the three model architectures. 8, C Here, for all three model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 1). Then for each domain listed in Table Przeciętne wyniki procentowe we wszystkich zadaniach dla tej domeny.Wreszcie, podsumowujemy wyniki procentowe we wszystkich domenach.Możemy zobaczyć, że dla równoważnej liczby tokenów, istnieje znaczna poprawa wydajności ze zwiększoną skalą. 4. 1 5.2 Out of distribution tasks In this section we want to answer the following question: For this reason, we held-out all data for four tasks from our pre-training set: cartpole.swingup (DM Control Suite domain), assembly-v2 (Meta-World domain), order_of_apples_forage_simple (DM Lab domain), and boxing (ALE Atari domain). These four tasks will serve as testbeds for evaluating the out-of-distribution capabilities of Gato. Can our agent be used to solve a completely new task efficiently? Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment. Fine-tuning is very similar to pretraining with minor changes, such as different learning rate schedule; see Section Dla szczegółów . E We want to measure how choice of data used during pretraining influences post-fine-tuning performance. To this end, we compare Gato (trained on ) do wariantów przeszkolonych na ablatowanych zestawach danych: all data Model przeszkolony tylko na danych z tej samej domeny, co zadanie, które ma być dostosowane, . Ta sama domena tylko dane 2. A model pretrained only on non-control data, . no control data 3. A model fine-tuned from scratch, i.e. no pretraining at all, . scratch Considering as all these experiments require training a new model from scratch and then also fine-tuning, we present results using the less compute-intensive 364M parameter architecture described in Section Results are shown in Figure 5.1 Właśnie 9. Fine-tuning performance on both cartpole.swingup and assembly-v2 tasks, both of which do not require image processing, present similar trends. Pretraining on all the datasets yields the best results, followed by pretraining on the same domain only. This difference is smaller for assembly-v2 but consistent for all few shot datasets. For these non-image-based environments, we see either no benefit (cartpole.swingup) or even negative transfer (assembly-v2) when pretraining on Zbiory danych, które zawierają tylko obrazy i dane tekstowe. Bez kontroli Results for DM Lab order_of_apples_forage_simple are slightly different. Pretraining on DM Lab data only is already enough to approach the maximum reward of 19 and hence there is no observable benefit of adding data from different environments. What is different when compared to previously analysed no-vision environments is that pretraining on data helps, which can be possibly explained by the fact that agents in the DM Lab environment are fed images which, despite being simulated, are natural looking. Therefore, transfer from image captioning or visual grounded question answering tasks is possible. no control We were not able to observe any benefit from pretraining on boxing. The randomly initialized model seems to work better than any of the pretrained variants considered. We hypothesise that this is caused by the game’s input images being visually very distinct from the other data, suggesting transfer is difficult. We discuss this Atari challenge further in our related work section. 5.3 Fine-tuning on Robotic Stacking Tasks Section demonstrates that the base Gato capable of a diverse array of tasks can perform competitively on the RGB Stacking Skill Generalization benchmark. In this section, we would like to answer the following question: *How does our agent improve on robotics tasks when allowed to fine-tune similarly to how we fine-tune on new tasks in Section *We consider different model sizes and analyse the impact of pretraining datasets on the Skill Generalization benchmark, as well as a novel out of distribution task. Further analysis of fine-tuning with dataset ablations is in Appendix 4.2 5.2? I. Umiejętności generalizacyjne First, we would like to show that fine-tuning on object-specific data, similarly to what was done by is beneficial. Therefore, we fine-tuned Gato separately on five subsets of demonstrations from the dataset. Each subset was obtained by random partitioning of a test dataset consisting of demonstrations gathered by a generalist sim-to-real agent stacking real test objects. We consider this setting, which is comparable to the fine-tuning baselines on RGB stacking tasks from and use the 5k dataset that their behavior cloning 5k results are obtained with. To best match their experiments, we change our return filtering scheme during training: instead of using only successful stacks, we condition on the normalized return of the episode. Lee et al. (2022), test (Lee et al., 2022); Figure porównuje wskaźnik sukcesu Gato w różnych systemach danych do eksperta sim-to-real i krytycznej regresji regulowanej (CRR) Gato, zarówno w rzeczywistości, jak i w symulacji (czerwone zakręty na lewym i prawym rysunku, odpowiednio), odzyskuje wydajność eksperta z zaledwie 10 odcinkami, a szczyty w 100 lub 1000 odcinków danych o subtelnym dostosowaniu, gdzie przekracza eksperta. 10 (Wang et al., 2020) Fine-tuning and Model Size To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time. 10 Adaptation to Perceptual Variations While the Skill Generalization task is an effective benchmark for motor Skill Generalization to shape varia-tions, it does not test the agent’s ability to adapt to perceptual variations and permutations in the objective specification. To further evaluate Gato’s generalization capabilities, we devised a new task in the RGB stacking benchmark where the goal is to stack the blue object on the green object, for test triplet 1 (see Figure First, we used a 3D mouse to collect 500 demonstrations of this task on the real robot, for a total of 2 hours and 45 minutes of demonstration data, and fine-tuned Gato on these episodes. Notably, all of the simulated and real robotics data in the pretraining set shows the robot successfully stacking the red object on the blue object, and the data does not include the object shapes in the test set. We found that additionally adding simulated demonstrations of the stack blue on green task to the fine-tuning dataset improved performance, and 10% was an ideal sampling ratio for this data. 11). Osiągnęliśmy ostateczny wskaźnik sukcesu 60% po ocenie dobrze dostosowanego Gato na prawdziwym robocie, podczas gdy bazowa linia BC przeszkolona od podstaw na danych blue-on-green osiągnęła tylko 0,5% sukcesu (1/200 odcinków). 5.4 Robotics: Skill Mastery Similarly to the Skill Generalization challenge discussed in Section the Skill Mastery challenge consists in training a robotic arm to stack blocks of different shapes. However, the Skill Mastery allows the agent to train on data involving the object shapes used for evaluation, i.e. the set in Skill Generalization becomes a part of the Skill Mastery set. Thus, this challenge serves to measure Gato’s performance on in-distribution tasks (possibly with initial conditions not seen in the training demonstrations). Our Skill Mastery results use an earlier version of the Gato architecture described in Appendix with no fine-tuning. 4.2, Testy training H, Table porównuje wskaźnik sukcesu grupowego i średni wskaźnik sukcesu między grupami obiektów dla Gato i ustaloną bazę BC-IMP. Gato przekracza lub zbliża się ściśle do wyników BC-IMP na wszystkich trójkach treningowych, z wyjątkiem jednego. 3 5.5 Specialist single-domain multi-task agents In this section we show results obtained with two specialist (rather than generalist) agents. Both of them were trained on data from a single domain only and rolled out 500 times for each training task without any per-task fine-tuning. Meta-World The first agent uses the smallest architecture introduced in Section i.e. 79M parameters, and is trained on all 50 Meta-World tasks. While Gato has access to the state of the MuJoCo physics engine and unlimited task seeds, the agent presented here has no access to any extra features or tasks and uses the canonical API as in (Y Ten eksperyment ma na celu wykazanie, że zaproponowaną w naszym artykule architekturę można wykorzystać do uzyskania najnowocześniejszych agentów również w małej skali. experts on each of the MT-50 tasks individually, recording the trajectories produced while training. This experience is then combined, or distilled, into a single agent, which achieves 96.6% success rate averaged over all 50 tasks. To the best of our knowledge this agent is the first one to accomplish nearly 100% average success rate simultaneously (multi-task) for this benchmark. See Table in the supplementary material (Section za pełną listę zadań i odpowiednie wskaźniki sukcesu naszego agenta. 5.1, U i al. 2020). (Abdolmaleki et al., 2018 r. 7 K) do ALE Atari We also trained a specialist agent on all 51 ALE Atari tasks. As the Atari domain is much more challenging than Meta-World, we used the Gato architecture with 1.18B parameters. Wynikający agent wykonuje się lepiej niż przeciętny człowiek w 44 meczach (patrz sekcja for details on our evaluation and scoring). We want to note that the performance of online experts used to generate training data for the other 7 games were also below the average human. Hence, the specialist Atari agent achieved better than human performance for all games where data contained super-human episodes. 4.1 The specialist Atari agent outperforms our generalist agent Gato, which achieved super-human performance on 23 games. It suggests that scaling Gato may result in even better performance. We, however, purposely restricted Gato’s size such that it can be run in real-time on the real robot. 5.6 Analiza uwagi Zwróciliśmy uwagę transformatora na obserwacje obrazu dla różnych zadań, aby uzyskać jakościowe poczucie, w jaki sposób Gato reaguje na różne regiony obrazu w różnych zadaniach (patrz Rysunek). Further details and visualizations for more tasks can be found in Appendix These visualizations clearly show that attention tracks the task-relevant objects and regions. 12). J. 5.7 Embedding Visualization To understand how Gato encodes differently information per task, we visualized per-task embeddings. We analysed 11 tasks. For each task, we randomly sample 100 episodes and tokenize each of them. Then, from each episode we take a subsequence of 128 tokens, compute their embeddings (at layer 12, which is half the total depth of the transformer layers) and average them over the sequence. The averaged embeddings for all tasks are used as input to PCA, which reduces their dimensionality to 50. Then, T-SNE is used to get the final 2D embeddings. Figure Pokazuje ostateczne osadzenia T-SNE narysowane w 2D, kolorowane według zadań. Osadzenia z tych samych zadań są wyraźnie zgrupowane razem, a klastry zadań z tej samej domeny i modalności znajdują się również blisko siebie. 13 6 Related Work The most closely related architectures to that of Gato are Decision Transformers , Trajektoriczny transformator which showed the usefulness of highly generic LM-like architectures for a variety of control problems. Gato also uses an LM-like architecture for control, but with design differences chosen to support multi-modality, multi-embodiment, large scale and general purpose deployment. Pix2Seq also uses an LM-based architecture for object detection. Perceiver IO ., wykorzystuje architekturę pochodzącą z transformatorów specjalizującą się w bardzo długich sekwencjach, aby zmodelować dowolną modalność jako sekwencję bajtów. (Chen et al., 2021b; Reid et al., w 2022 roku; Zheng et al., 2022; Furuta et al. 2021) (Janner et al., 2021), (Rzeszów et al., 2022) (Jaegle et al 2021) Gato was inspired by works such as GPT-3 and Gopher pushing the limits of generalist language models; and more recently the Flamingo generalist visual language model. developed the 540B parameter Pathways Language Model (PalM) explicitly as a generalist few-shot learner for hundreds of text tasks. (Brown et al., 2020) (Rae et al., 2021), (Alayrac et al., 2022) Chowdhery et al. (2022) Future work should consider how to unify these text capabilities into one fully generalist agent that can also act in real time in the real world, in diverse environments and embodiments. Gato also takes inspiration from recent works on multi-embodiment continuous control. used message passing graph networks to build a single locomotor controller for many simulated 2D walker variants. showed that transformers can outperform graph based approaches for incom-patible (i.e. varying embodiment) control, despite not encoding any morphological inductive biases. learn a modular policy for multi-task and multi-robot transfer in simulated 2D manipulation environments. train a universal policy conditioned on a vector representation of robot hardware, showing successful transfer both to simulated held out robot arms, and to a real world sawyer robot arm. Huang et al. (2020) Kurin et al. (2020) Devin et al. (2017) Chen et al. (2018) A variety of earlier generalist models have been developed that, like Gato, operate across highly distinct domains and modalities. NPI Szkolenie z pojedynczym LSTM to execute diverse programs such as sorting an array and adding two numbers, such that the network is able to generalize to larger problem instances than those seen during training. developed the MultiModel that trains jointly on 8 distinct speech, image and text processing tasks including classifica-tion, image captioning and translation. Modality-specific encoders were used to process text, images, audio and categorical data, while the rest of the network parameters are shared across tasks. proposed “ ”, describing a method for the incremental training of an increasingly general problem solver. proposed controllable multi-task language models that can be directed according to language domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. (Reed & De Freitas, 2016) (Hochreiter & Schmidhuber, 1997 roku) Ksiądz i al. 2017 roku Schmidhuber (2018) one big net for everything Keskar et al. (2019) W tej dyskusji ważne jest rozróżnienie między jedną architekturą sieci wielozadaniowej a jedną siecią neuronalną o tych samych obciążeniach dla wszystkich zadań. However, it is much more common to use the same policy architecture and hyper-parameters across tasks, but the policy parameters are different in each task This is also true of state-of-the-art RL methods applied to board games Moreover, this choice has been adopted by off-line RL benchmarks i ostatnie prace nad sieciami neuronowymi o dużej sekwencji do kontroli, w tym transformatorami decyzji and the Trajectory Transformer of In contrast, in this work we learn a single network with the same weights across a diverse set of tasks. (Espeholt et al., 2018; Song et al., w 2020 roku; Hessel et al., 2019 roku). (Mnih et al., 2015; Tassa et al., 2018 roku). (Schrittwieser et al., 2020). (Gulcehre et al., 2020; Fu et al., 2020) (Chen et al., 2021b; Reid et al., 2022; Zbigniew et al., 2022) Janner et al. (2021). Recent position papers advocate for highly generalist models, notably proposing one big net for everything, and Jednakże, do naszej wiedzy, nie ma jeszcze jednego generalisty przeszkolonego w setkach zadań wizji, języka i sterowania przy użyciu nowoczesnych sieci transformatorów na skalę. Schmidhuber (2018) jako Bommasani et al. (2021) “Single-brain”-style models have interesting connections to neuroscience. famously stated that “ ”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence Mountcastle (1978) funkcja przetwarzania modułów neokortykalnych jest jakościowo podobna we wszystkich regionach neokortykalnych. Krótko mówiąc, nie ma nic wewnętrznie motorycznego na temat kory ruchowej, ani sensorycznego na temat kory zmysłowej (Hawkins & Blakeslee, 2004). Sensory substitution provides another argument for a single model For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect. (Bach-y Rita & Kercel, 2003). Nasza praca opiera się na głębokich modelach autoregresyjnych, które mają długą historię i można znaleźć w generatywnych modelach tekstu, obrazów, wideo i dźwięku. has been of enormous impact in language modelling protein folding vision-language models (T Generacja kodów dialogue systems with retrieval capabilities Rozpoznanie mowy neural machine translation and more , Recently researchers have explored task decomposition and grounding with language models aswani et al., 2017; Devlin et al., 2018) (Brown et al., 2020; Rae et al., 2021), (Jumper et al., 2021), simpoukelli et al., w 2021 roku; Wang et al. 2021; Alayrac et al., 2022), (Chen et al., w 2021 r. Li et al., 2022b), (Nakano et al., 2021; Thoppilan et al., 2022), (mówi się et al., w roku 2020), (Johnson et al., 2019 roku) (Bommasani et al. 2021). (Zdrój i al., 2022; Ahn et al., 2022). construct a control architecture, consisting of a sequence tokenizer, a pretrained language model and a task-specific feed-forward network. They apply it to VirtualHome and BabyAI tasks, and find that the inclusion of the pretrained language model improves generalisation to novel tasks. Similarly, wykazać, że modele wizji są wstępnie przeszkolone za pomocą samodzielnego uczenia się, w szczególności segmentacji upraw i kontrastu momentum can be effectively incorporated into control policies. Li et al. (2022a) Parisi et al. (2022) (He et al., 2020), As mentioned earlier, transfer in Atari is challenging. researched transfer between ran-domly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Rosjanie i al. W 2016 roku Kanervisto et al. (2020). There has been great recent interest in data-driven robotics Jednakże note that in robotics “ ”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data. (Cabi et al., 2019; Chen et al., W 2021 r. Bommasani et al. (2021) W przeciwieństwie do danych o języku i widzeniu, dane robotyki nie są ani obfite, ani reprezentatywne dla wystarczająco zróżnicowanego zakresu wdrożeń, zadań i środowisk. Generating actions using an autoregressive model can lead to causal “self-delusion” biases when there are confounding variables For example, sampling actions can condition the model to solve the wrong task when multiple tasks share similar observation and actions specifications. As explained in Section we use prompt engineering in ambiguous tasks, conditioning our model on a successful demon-stration. This screens off confounding variables, reducing self-delusions. Another solution which we did not explore in this work is to use counterfactual teaching, where we train a model online using instantaneous expert feedback. We leave this for future investigation. (Ortega et al., 2021). 2, 7 Broader Impact Although generalist agents are still only an emerging area of research, their potential impact on society calls for a thorough interdisciplinary analysis of their risks and benefits. For the sake of transparency, we document the intended use cases of Gato in the model card in Appendix However, the tools for mitigating harms of generalist agents are relatively underdeveloped, and require further research before these agents are deployed. A. Since our generalist agent can act as a vision-language model, it inherits similar concerns as discussed in In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. (Wei-dinger et al., 2021; Bommasani et al., 2021; Rae et al., w 2021 roku; Alayrac et al., w 2022 roku). Technical AGI safety may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (R są szczególnie ważne dla projektowania kompatybilnych z ludźmi agentów generalistycznych.Może być możliwe rozszerzenie niektórych metod wyrównywania wartości dla języka to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight This limitation underscores the need for a careful design and a deployment process that incorporates multiple disciplines and viewpoints. (Przegląd w Bostonie 2017) ussell, 2019 roku) (Ouyang et al., 2022; Kenton et al., 2021) (Amodei et al., 2016). Understanding how the models process information, and any emergent capabilities, requires significant ex-perimentation. External retrieval has been shown to improve both interpretability and performance, and hence should be consid-ered in future designs of generalist agents. (Borgeaud et al., 2021; Menick et al., 2022; Nakano et al., w 2021 roku; Thoppilan et al., 2022) Although still at the proof-of-concept stage, the recent progress in generalist models suggests that safety researchers, ethicists, and most importantly, the general public, should consider their risks and benefits. We are not currently deploying Gato to any users, and so anticipate no immediate societal impact. However, given their potential impact, generalist models should be developed thoughtfully and deployed in a way that promotes the health and vitality of humanity. 8 Limitations and Future work 8.1 RL data collection Gato is a data-driven approach, as it is derived from imitation learning. While natural language or image datasets are relatively easy to obtain from the web, a web-scale dataset for control tasks is not currently available. This may seem at first to be problematic, especially when scaling Gato to a higher number of parameters. That being said, there has already been extensive investigation into this issue. Offline RL aims at leveraging existing control datasets, and its increasing popularity has already resulted in the availability of more diverse and larger datasets. Richer environments and simulations are being built (e.g. Metaverse), and increasing numbers of users already interact with them among thousands of already deployed online games (e.g. there exists a large dataset of Starcraft 2 games). Real-life data has also been already stored for ML research purposes; for example, data for training self-driving cars is acquired from recording human driver data. Finally, while Gato uses data consisting of both observations and corresponding actions, the possibility of using large scale observation-only data to enhance agents has been already studied (Baker et al , 2022). Thanks to online video sharing and streaming platforms such as Youtube and Twitch, observation-only datasets are not significantly more difficult to collect than natural language datasets, motivating a future research direction to extend Gato to learn from web data. While the previous paragraph focuses on alleviating drawbacks of data collection from RL agents, it is important to note that this approach presents a different set of tradeoffs compared to scraping web data and can be actually more practical in some situations. Once the simulation is set up and near SOTA agent trained, it can be used to generate massive amounts of high quality data. That is in contrast to the quality of web data which is notorious for its low quality. In short, we believe that acquiring suitable data is another research question on its own, and this is an active area of research with growing momentum and importance. 8.2 Prompt and short context Gato is prompted with an expert demonstration, which aids the agent to output actions corresponding to the given task. This is particularly useful since there is otherwise no task identifier available to the agent (that is in contrast to many multi-task RL settings). Gato infers the relevant task from the observations and actions in the prompt. However, the context length of our agent is limited to 1024 tokens which translates to the agent sometimes attending to only a few environment timesteps in total. This is especially the case for environments with image observations, where depending on the resolution each observation can result in more than one hundred tokens each. Hence for certain environments only a short chunk of a demonstration episode fits in the transformer memory. Podobnie wczesne oceny modelu z wykorzystaniem uczenia się w kontekście opartego na proście w nowych środowiskach nie wykazały znaczącej poprawy wydajności w porównaniu z oceną bez progu w tym samym ustawieniu. Długość kontekstu jest zatem obecnym ograniczeniem naszej architektury, głównie ze względu na kwadratową skalę samooceny.Wiele niedawno zaproponowanych architektur umożliwia dłuższy kontekst przy większej wydajności, a te innowacje mogą potencjalnie poprawić wydajność naszych agentów. 9 Conclusions Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. In the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch. Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent. Uznania We would like to thank Dan Horgan, Manuel Kroiss, Mantas Pajarskas, and Thibault Sottiaux for their help with data storage infrastructure; Jean-Baptiste Lespiau and Fan Yang for help on concurrent evalua-tion; Joel Veness for advising on the model design; Koray Kavukcuoglu for helping inspire the project and facilitating feedback; Tom Erez for advising on the agent design and task selection for continuous control; Igor Babuschkin for helping code the initial prototype; Jack Rae for advising on the transformer language model codebase; Thomas Lampe for building robot infrastructure and advising on real robotics experiments; Boxi Wu for input on ethics and safety considerations; Pedro A. Ortega for advice in regard to causality and self-delusion biases. Author Contributions developed the project concept, wrote the initial prototype, and led the project overall. Poprowadził rozwój architektury dla wizji i tekstu, zbudował infrastrukturę do tokenizacji i promowania, a także wniósł duży wkład w ogólny rozwój i ocenę agentów. Scott Reed Konrad Żołna led work on optimizing the transformer architecture, ran the largest number of experi-ments, and analyzed scaling law properties and in-distribution agent performance. Emilio Parisotto was the technical lead, responsible for creating a scalable data loader and evaluator supporting hundreds of tasks at once, and for the initial robot integration with Gato. Sergio Gómez Colmenarejo developed the model including the sampler for the initial prototype, carried out ex-periments focusing on robotics, and created visualizations. Alexander Novikov built scalable storage infrastructure to provide Gato with SoTA-level agent expe-rience in Atari and other domains. Gabriel Barth-Maron conducted large scale agent data collection, built substantial data loading infrastructure, and integrated large scale visual-language datasets into the training of Gato. Mai Giménez przyczynił się szeroko do bazy kodowej Gato, w tym do niestandardowego rozproszonego ładowarki sekwencji szkoleniowej, i doprowadził do opracowania wskaźników dla generalizacji poza dystrybucją oraz szkolenia konkurencyjnych agentów bazowych. Yury Sulsky supported physical robotics infrastructure, conducted numerous evaluations and experiments to analyze the generalization properties of Gato, and contemplated broader ethical impact. Jackie Kay kierował rozmieszczeniem Gato do fizycznego robota, zapewnił silne istniejące linie bazowe do układania bloków i doradził w zakresie rozwoju modeli i projektowania eksperymentalnego. Jost Tobias Springenberg developed the Gato dialogue and image captioning demonstrations, allowing users to easily probe the vision and language capacities of agents in development. Tom Eccles przyczynił się do projektowania agentów, a także do kontrolowania zbiorów danych i środowisk z randomizowanymi wariantami fizyki i morfologii. Jake Bruce helped in exploring vision architectures. Ali Razavi contributed to the first prototype of Gato that worked on Atari, in addition to exploring alternative network architectures and training objectives. Ashley Edwards advised on agent design, experiment design and task selection, especially for continuous control applications. Nicolas Heess doradzał w zakresie projektowania modeli i eksperymentów oraz udzielał informacji zwrotnych na regularnych spotkaniach. Yutian Chen doradztwo w zakresie projektowania i planowania robotyki. Raia Hadsell doradztwo we wszystkich aspektach projektu, w szczególności w architekturze modeli, strategiach szkoleniowych i projektowaniu punktów referencyjnych. Oriol Vinyals was the primary project manager; eliciting key goals, tracking progress, facilitating pre-sentations and feedback, and coordinating resource planning. Mahyar Bordbar oversaw the project from its inception. Nando de Freitas Referencje Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess i Martin Ried-miller. Maksymalna optymalizacja polityki a posteriori. , 2018. Preprint arXiv:1806.06920 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. , 2020. Preprint arXiv:2005.00928 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. , 2022. Preprint arXiv:2204.01691 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. , 2022. Preprint arXiv:2204.14198 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. , 2016. Preprint arXiv:1606.06565 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In , pp. 2425–2433, 2015. International Conference on Computer Vision Jimmy Lei Ba, Jamie Ryan Kiros i Geoffrey E. Hinton. w 2016 r. Preprint arXiv:1607.06450 Paul Bach-y Rita and Stephen W Kercel. Sensory substitution and the human-machine interface. , 7(12):541–546, 2003. Trendy w naukach kognitywnych Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. , 2022. Preprint arXiv: 2206.11795 Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. , 2018. Preprint arXiv:1804.08617 Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. , 2016. Preprint arXiv:1612.03801 Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. , 47:253–279, 2013 r. Journal of Artificial Intelligence Research Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. , 2021. Preprint arXiv:2108.07258 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark i inni. , 2021. Preprint arXiv:2112.04426 Nick Bostrom. . Dunod, 2017. Superintelligence Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang i Wojciech Zaremba. , 2016. Preprint arXiv:1606.01540 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. In , pp. 1877–1901, 2020. Advances in Neural Information Processing Systems Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. , 2019. Preprint arXiv:1909.12200 Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the-wild" human videos. , 2021a. Preprint arXiv:2103.16817 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar-avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. , 34, 2021b. Advances in Neural Information Processing Systems Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman i inni. W 2021 r. Preprint arXiv:2107.03374 Tao Chen, Adithyavairavan Murali, and Abhinav Gupta. Hardware conditioned policies for multi-robot transfer learning. , 31, 2018. Advances in Neural Information Processing Systems Ting Chen, Saurabh Saxena, Lala Li, David J Fleet i Geoffrey Hinton. Pix2seq: Ramy modelowania języka do wykrywania obiektów. w 2022 r. ICLR Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. , 2015. Preprint arXiv:1504.00325 Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. , 2018. Preprint arXiv:1810.08272 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. w 2022 r. Preprint arXiv:2204.02311 Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In , pp. 2048–2056, 2020. International Conference on Machine Learning Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In , pp. 2978–2988, 2019 r. Annual Meeting of the Association for Computational Linguistics Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In , pp. 2169–2176, 2017. IEEE International Conference on Robotics & Automation Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. , 2018. Preprint arXiv:1810.04805 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. , 2020. Preprint arXiv:2010.11929 Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. In , pp. 1407–1416, 2018 r. International Conference on Machine Learning Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. , 2020. Wstępne wydanie arXiv:2004.07219 Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. , 2021. Preprint arXiv:2111.10364 Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. , 33:7248–7259, 2020. Advances in Neural Information Processing Systems Jeff Hawkins and Sandra Blakeslee. . Macmillan, 2004. On intelligence Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pp. 770–778, 2016a. IEEE Computer Vision i rozpoznawanie wzorów Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In , pp. 630–645, 2016b. Europejska Konferencja Wizji Komputerowej Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In , pp. 9729–9738, 2020. IEEE Computer Vision and Pattern Recognition Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). , 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Preprint arXiv:1606.08415 Multi-task deep reinforcement learning with popart. In , 2019. AAAI Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization. w 2021 r. Preprint arXiv:2104.06159 Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. , 9(8):1735–1780, 1997. Neural computation Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. , 2022. Preprint arXiv:2203.15556 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. , 2016. Preprint arXiv:1603.09382 Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In , pp. 4455–4464, 2020. International Conference on Machine Learning Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. , 2022. Preprint arXiv:2201.07207 David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. , 2020. Preprint arXiv:2007.12770 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. , 2021. Preprint arXiv:2107.14795 Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. , 34, 2021. Advances in Neural Information Processing Systems Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In , pp. 4904–4916, 2021. Międzynarodowa konferencja na temat uczenia maszynowego Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In , pp. 3874–3884, 2019. Konferencja Północnoamerykańskiego Rozdziału Stowarzyszenia Lingwistyki Obliczeniowej: Technologia języka ludzkiego John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. , 596(7873):583–589, w 2021 r. natury Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones i Jakob Uszkoreit. , 2017. Preprint arXiv:1706.05137 Anssi Kanervisto, Joonas Pussinen, and Ville Hautamäki. Benchmarking end-to-end behavioural cloning on video games. In , str. 558–565, 2020 r. IEEE conference on games (CoG) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu i Dario Amodei. w 2020 r. Wstępne wydanie arXiv:2001.08361 Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos i Will Dabney. Powtarzające się doświadczenie w nauce rozproszonego wzmocnienia. , 2018. International Conference on Learning Representations Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik i Geoffrey Irving. , 2021. Preprint arXiv:2103.14659 Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. W 2019 r. Preprint arXiv:1909.05858 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. , 2014. Preprint arXiv:1412.6980 Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In , Roczne posiedzenie Stowarzyszenia Lingwistyki Obliczeniowej pp. 66–71, 2018. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer i Shimon Whiteson. Moje ciało jest klatką: rola morfologii w niekompatybilnej kontroli opartej na grafie. w 2020 r. Wstępne wydanie arXiv:2010.01856 Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis, Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David Khosid, et al. Ponad pick-and-place: Zwalczanie robotycznego układania różnych kształtów. , 2021. Konferencja o nauce robotów Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol-maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation. w 2022 r. Preprint arXiv:2205.03353 Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making. W 2022 r. Preprint arXiv:2202.01771 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode. w roku 2022b. Preprint arXiv:2203.07814 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. , 2017. Preprint arXiv:1711.05101 Kenneth Marino, Mohammad Rastegari, Ali Farhadi i Roozbeh Mottaghi. Ok-VQA: wizualne pytanie odpowiadające na kryterium wymagające wiedzy zewnętrznej. ,pp. 3195–3204, 2019. IEEE Computer Vision and Pattern Recognition Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. , 2022. Preprint arXiv:2203.11147 Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji i Timnit Gebru. , pp. 220–229, 2019 r. Proceedings of the conference on fairness, accountability, and transparency Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski i inni. , 518(7540):529–533, 2015 r. natury Vernon Mountcastle. An organizing principle for cerebral function: the unit module and the distributed system. w 1978 r. Mądry mózg Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Pytania i odpowiedzi obsługiwane przez przeglądarkę z ludzką opinią. , 2021. Preprint arXiv:2112.09332 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior i Koray Kavukcuoglu. , 2016. Preprint arXiv:1609.03499 Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. , 2021. Preprint arXiv:2110.10819 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. , 2022. Preprint arXiv:2203.02155 Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec-tiveness of pre-trained vision models for control. w 2022 r. Preprint arXiv:2203.03580 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve i Ronan Collobert. Masywnie wielojęzyczny ASR: 50 języków, 1 model, 1 miliard parametrów. , 2020. Preprint arXiv:2007.03001 Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. , 30, 2017. Zaawansowane systemy przetwarzania informacji neuronowej Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. , 2021. Preprint arXiv:2112.11446 Scott Reed i Nando De Freitas – programistki neuronowe , 2016. Międzynarodowa Konferencja Naukowych Reprezentacji Machel Reid, Yutaro Yamada i Shixiang Shane Gu. Czy Wikipedia może pomóc w nauce offline? w 2022 r. Preprint arXiv:2201.12122 Stuart Russell. . Penguin, 2019. Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Human compatible: Artificial intelligence and the problem of control Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. w 2016 r. Preprint arXiv:1606.04671 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abhehtes Sharma, Andrea Santilli, Thibault Fryevry, Alan Jason Fries, Ryan Teehan, Teven Le Sca , 2022. International Conference on Learning Representations Jürgen Schmidhuber. One big net for everything. , 2018. Preprint arXiv:1802.08864 Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. , 588(7839):604–609, 2020. Nature Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hyper-nymed, image alt-text dataset for automatic image captioning. In , pp. 2556–2565, 2018. Roczne posiedzenie Stowarzyszenia Lingwistyki Obliczeniowej Noam Shazeer. Glu variants improve transformer. w 2020 r. Wstępne wydanie arXiv::2002.05202 H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. In , 2020. ICLR Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. , 15(56): 1929–1958, 2014. Journal of Machine Learning Research Richard Sutton. The bitter lesson. , 13:12, 2019. Niedokończone pomysły (blog) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. w 2018 r. Wstępne wydanie arXiv:1801.00690 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. w 2022 r. Preprint arXiv:2201.08239 Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In , pp. 5026–5033, 2012. International Conference on Intelligent Robots and Systems Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals i Felix Hill. , pp. 200–212, 2021. Zaawansowane systemy przetwarzania informacji neuronowej Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. , 6:100022, 2020. Software Impacts Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. , 30 lipca 2017 r. Zaawansowane systemy przetwarzania informacji neuronowej Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov i Yuan Cao. Simvlm: prosty model języka wizualnego wstępnego szkolenia ze słabym nadzorem. , 2021. Preprint arXiv:2108.10904 Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess et al. Krytyka uregulowała regresję. , 33:7768–7778, 2020 r. Advances in Neural Information Processing Systems Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. , 2021. Preprint arXiv:2109.01652 Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. , 2021. Preprint arXiv:2112.04359 Yuxin Wu and Kaiming He. Group normalization. In , pp. 3–19, 2018 r. European Conference on Computer Vision Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In , pp. 1094–1100, 2020. Conference on Robot Learning Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. , 2022. Preprint arXiv:2202.05607 Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. , 2020. Preprint arXiv:2011.13885 Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gómez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial imitation learning. In , str. 247–263, 2021. Conference on Robot Learning Supplementary Material A Model card We present a model card for Gato in Table 4. Table 4: Stosujemy ramy zaproponowane w Gato Model Card. (Mitchell et al., 2019). B Agent Data Tokenization Details In this section we provide additional details on our tokenization schemes. Our agent data is sequenced as follows: • Przedstawiamy je pracownikowi w porządku czasowym (tymczasowym). Episodes • in turn are presented in the following order: Timesteps (Nie 1: 1: 1 : ]) are ordered lexicographically by key, each item is sequenced as follows: – Observations y k, x m, z n ∗ Text tokens ( 1: ) są w tej samej kolejności co surowy tekst wejściowy. y k ∗ Obraz patch tokenów ( 1: ) are in raster order. x m ∗ Tensors ( 1: ) (such as discrete and continuous observations) are in row-major order. z n - w (' '); a designated separator token is provided after observations. Separator | – ( 1: ) are tokenized as discrete or continuous values and in row-major order. Actions a A A full sequence of tokens is thus given as the concatenation of data from T timesteps: where L = T(k + m + n + 1 + A) is the total number of tokens. Each floating point element of tensors in the observation sequence is mu-law companded as in WaveNet (Przegląd et al., 2016): with parameters µ = 100 and M = 256. (If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range \[ 1, 1\] for all our environments.) All the elements are subsequently clipped so that they fall in the set \[ 1, 1\]. Finally, they are discretized using bins of uniform width on the domain \[ 1,1\]. We use 1024 bins and shift the resulting integers so they are not overlapping with the ones used for text tokens. The tokenized result is therefore a sequence of integers within the range of \[32000, 33024). Ta figurka and Figure for visualizations of tokenizing and sequencing values (both discrete and con-tinuous) and images. See Section Szczegóły dotyczące lokalnych kodów pozycji wymienionych w liczbach. 14 15 C C Model Architecture C.1 Transformer Hyperparameters The transformer hyperparameters of Gato are presented in Table We also list the hyperparameters of smaller architecture variants used in Section 5. 5. C.2 Wbudowana funkcja The ResNet block uses the v2 architecture Zawiera GroupNorm with 32 groups instead of LayerNorm and GELU funkcje aktywacji zamiast RELU. blok jest wykresowany na rysunku On i al., 2016b), (Wu i He, 2018 r. Ba et al. w 2016 roku), (Hendrycks & Gimpel, 2016) 16. C.3 Kodowanie pozycji After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. Kodowanie pozycji patch Te kodowania pozycji przekazują informacje o globalnej pozycji patchu w obrębie obrazu, z którego został wyodrębniony patch. Po pierwsze, względne odstępy rzędu i kolumny patchu są obliczane przez normalizację odstępów pikseli patchu za pomocą rozdzielczości obrazu. Normalizowane odstępy rzędu i kolumny są następnie kwantyfikowane do wielkości słownika (używamy 128) i są używane do indeksowania tabeli rzędu i kolumny kodowania pozycji, które można się nauczyć. Metoda, w której kwantyfikowane odstępy rzędu i kolumny są przekształcane w indeksy, zależy od tego, czy trenujemy, czy oceniamy model: podczas szkolenia losowy indeks jest równomiernie pobierany z kwantyfikowanego odst Aby bardziej konkretnie zademonstrować ten proces, podajemy przykład na rysunku [17.](#_bookmark144) Postępujemy zgodnie z procesem z patchem wyróżnionym na czerwono po lewej stronie podobrazu. Obraz ma rozdzielczość 80 64 i każdy patch jest 16 16, co oznacza, że jest 5 4 = 20 patchów w sumie. Podkreślany patch rozpoczyna się przy intervalu rzędu pikseli \[16*,* 32\] i intervalu kolumny pikseli \[32*,* 64\]. Normalizowany, więc interwał rzędu jest \[0*25*,* 0*.*5\] i interwał kolumny jest \[0*4*,* 0*.*6\]. Następnie oddzielnie kwantizujemy interwały w 128 równomiernie oddzielonych Local Observation Position Encodings Po pierwsze, powtarzamy, że podczas tokenizacji, dla każdego etapu, wszystkie elementy zestawu obserwacji są tokenizowane w sekwencje i kojarzone w sekwencję obserwacji. Każdy token w tej sekwencji obserwacji otrzymuje indeks odpowiadający kolejności sekwencji, tj. pierwszy token to 0, a ostatni to długość sekwencji obserwacji minus jeden. Po osadzeniu, dla wszystkich tokenów, które były częścią zestawu obserwacji, odpowiedni indeks tokenów obserwacji jest używany do osadzenia tabeli kodów pozycji do nauki, z jednym osadzeniem dla każdego możliwego indeksu tokenów obserwacji (w praktyce po prostu ustawiamy tabelę na dużą wartość, taką jak 512). / Kodowanie pozycji jest następnie dodawane do wstawiania tokenów obserwacyjnych w celu uzyskania ostatecznego wstawiania tokenów. Zwróć uwagę, że wszystkie tokeny akcji otrzymują to samo kodowanie pozycji niezależnie od ich pozycji w sekwencji kroków czasowych. 18. D Pretraining Setup Dla wszystkich modeli używamy AdamW optimizer with a linear warm-up and cosine schedule decay. The linear warmup lasts for 15*,* 000 steps, starting from a learning rate of 1e-7 and ending at a different maximum learning rate depending on the model (see Table This learning rate is then cosine decayed by a factor 10x over 1,000,000 steps. The AdamW optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. We use a batch size of 512 and a sequence length of 1024 tokens for all models. Optimizer: (Loshchilov & Hutter, 2017) 6). β 9, β ϵ Trenujemy z parametrem rozkładu masy AdamW 0,1. Dodatkowo używamy głębokości stochastycznej podczas wstępnego treningu, w którym każda z podwarstw transformatora (tj. każda warstwa Multi-Head Attention i Dense Feedforward) jest pomijana z prawdopodobieństwem 0,1. Regularization: (Huang et al., 2016) E Fine-tuning Setup For all models we use the Adam optimizer with a constant learning rate of 1e-5. The Adam optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. We use a batch size of 64 and a sequence length of 1024 tokens for all models. We train for 10,000 gradient steps. Optimizer: (Kingma & Ba, 2014 roku) β 9, w ϵ We use dropout with a rate of 0.1. Regularization: (Słupsk et al., 2014 roku) We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The final fine-tuning performance is defined as the maximum of these smoothed scores. Evaluation: We generated data for the fine-tuning tasks the same way we did for the other tasks (see Section 3.1 for details). Instead of using all the data for a fine-tuning task, we discarded all but 2000 best episodes (leading to the highest returns). The fine-tuning datasets were created in the following way. We randomly took 1000 episodes (out of 2000 preselected episodes), then a subset of 100 episodes from the selected episodes, then 10, 5, 3, and finally a single episode. We repeated this procedure 3 times to obtain 3 series of cascading subsets for each task. Each subset is used to conduct one fine-tuning experiment, and each is reported on our plots in Section as a separate point. Datasets: 5.2 We have not altered any of the tasks and used their canonical versions. As 3 out of 4 tasks are open sourced, they do not need further explanation. For the fourth task, DMLab order_of_apples_forage_simple, the goal is to collect apples in the right order, green ones first followed by the gold one. Task settings: F Data Collection Details F.1 Atari We collect two separate sets of Atari environments. The first (that we refer to as ALE Atari) consists of 51 canonical games from the Arcade Learning Environment Drugi (który nazywamy ALE Atari Extended) to zestaw alternatywnych gier with their game mode and difficulty randomly set at the beginning of each episode. (Bellemare et al., W 2013 roku). 3 For each environment in these sets we collect data by training a Muesli agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. (Hiszpanie i al., w roku 2021) F2 Sokoban Sokoban is a planning problem in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ahead of time is therefore necessary to succeed at this puzzle. We use a Muesli agent to collect training data. (Racanière et al., 2017), (Hiszpanie i al., 2021) F3 Dziecko BabyAI to środowisko gridworld, którego poziomy składają się z zadań podążających za instrukcjami, które są opisane za pomocą syntetycznego języka.Generujemy dane dla tych poziomów za pomocą wbudowanego botu BabyAI.Bot ma dostęp do dodatkowych informacji, które są używane do wykonywania optymalnych rozwiązań, patrz sekcja C w załączniku do for more details about the bot. We collect 100,000 episodes for each level. (Chevalier-Boisvert et al., 2018) F.4 DeepMind Control Suite The DeepMind Control Suite (T ... , is a set of physics-based simulation environments. For each task in the control suite we collect two disjoint sets of data, one using only state features and another using only pixels. We use a D4PG agent to collect data from tasks with state features, and an MPO based agent to collect data using pixels. unyasuvunakool et al 2020; Tasiu et al. 2018) (Barth-Maron i in., 2018 r. (Abdolmaleki et al., 2018 r. Gromadzimy również dane dla randomizowanych wersji zadań pakietu sterowania za pomocą agenta D4PG. Te wersje randomizują układ napędowy, zakres stawów, sztywność i osłabienie oraz rozmiar geomu i gęstość. Istnieją dwa ustawienia trudności dla randomizowanych wersji. „0” [ 1 ] ,* 1*. . „0” [ 1 ] , * 1 * * 4 . 9 95] ∪ 05 1]. The large setting scales values by a random number sampled from the union of intervals [0 6 8] 2 F5 DeepMind Laboratorium DeepMind Lab , jest środowiskiem 3D w pierwszej osobie zaprojektowanym do nauczania agentów wizji 3D z surowych wejść pikseli z egocentrycznym punktem widzenia, nawigacją i planowaniem. (Beattie et al. 2016) Szkoliliśmy się z IMPALA Agent wspólnie na zestawie 18 poziomów DM Lab, które generują mapy proceduralnie dla każdego nowego odcinka.Dane zostały zebrane przez wykonanie agenta na tych 18 poziomach, a także dodatkowy zestaw 237 poziomów ręcznie wykonanych, aby przetestować zróżnicowany zestaw umiejętności. (Przedstawiciele et al. 2018 r. 18 poziomów rodzicielskich charakteryzuje się dużą różnorodnością generowanych map. Różnica między poziomami jest zakorzeniona w hiperparametrach używanych w procesie generowania. Te hiperparametry kontrolują cechy wysokiego poziomu, takie jak rodzone typy struktur, trudność instrukcji językowych lub obecność określonych narzędzi. W przeciwieństwie do poziomów rodzicielskich, każdy z dodatkowych ręcznie wykonanych 237 poziomów używa prawie tej samej mapy, a główne różnice między przykładami mapy tego samego poziomu są estetyczne, takie jak kolory ścian lub warunki oświetlenia. procedurally generated and were designed to test a diverse set of skills such as walking up stairs or using specific tools. They are similar to levels presented in Figure 3, Figure 7 and Figure 8 in aforementioned paper by Nie Beattie et al. (2016 r.) Dodatkowe informacje na temat 18 poziomów rodzicielskich (i ich relacji z innymi poziomami) są szczegółowo przedstawione w dyskusji NeurIPS Workshop. by Daniel Tanis . Metodologia badań środowiskowych RL 4 W sumie zebraliśmy dane dla 255 poziomów z DeepMind Lab (18 poziomów rodzicielskich i 237 poziomów ręcznych), z których 254 zostało wykorzystanych podczas szkolenia Gato. F6 Procgen Benchmark Proczen Jest to zestaw 16 procesowo generowanych środowisk podobnych do Atari, które zostały zaproponowane w celu oceny efektywności próbek i uogólnienia w uczeniu się wzmocnień. agent on each of the environments. We used the hard difficulty setting for all environments except for maze and heist, which we set to easy. (Kobieta i al., w 2020 roku) (Kapturowski et al., 2018 r. F.7 Modular RL Moduł RL is a collection of MuJoCo (T środowiska ciągłego sterowania, składające się z trzech zestawów wariantów OpenAI Gym Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of morphologies is generated by enumerating all possible subsets of limbs, and keeping only those sets that a) contain the torso, and b) still form a connected graph. This results in a set of variants with different input and output sizes, as well as different dynamics than the original morphologies. We collected data by training a single morphology-specific D4PG agent on each variant for a total of 140M actor steps, this was done for 30 random seeds per variant. (Zdrój i al., w 2020 roku) Odorow et al. 2012) (Brockman et al., 2016 roku F.8 DeepMind Manipulation Playground Miejsce gry DeepMind Manipulation Zbieramy dane dla 4 zadań Jaco (skrzynka, stos banana, wstawianie i slajd) za pomocą agenta Regresji Krytycznej Regularyzowanej (CRR) Zebrane dane obejmują stan fizyki MuJoCo, który używamy do szkolenia i oceny Gato. (Zolna et al., w roku 2021) (Rzeszów et al., 2020) F.9 Meta-Świat Meta-Świat (Y is a suite of environments do porównywania uczenia się meta-wzmocnienia i uczenia się wielozadaniowego. Zbieramy dane ze wszystkich zadań szkoleniowych i testowych w trybie MT50 poprzez szkolenie agenta MPO with unlimited environment seeds and with access to state of the MuJoCo physics engine. The collected data also contains the MuJoCo physics engine state. u et al., w 2020 roku) 5 (Poznańskie et al., 2018 r. G Szczegóły oceny rzeczywistej robotyki W rzeczywistym świecie kontrola jest asynchroniczna; fizyka nie czeka, aż obliczenia się zakończą. W związku z tym opóźnienie końcowe jest problemem oceny dużego modelu dla zadań w świecie rzeczywistym. W robotyce uważa się, że szybka kontrola jest kluczowa dla reakcji na zjawiska dynamiczne. Konfiguracja robota do gromadzenia RGB ma współczynnik kontroli 20Hz (0.05 sekundy) według projektu. Aby osiągnąć akceptowalny margines opóźnienia, zmodyfikowaliśmy wniosek w czasie oceny poprzez skrócenie długości kontekstu do 1. Wdrożyliśmy również schemat równoległego pobierania próbek, w którym wszystkie tokeny działania są zerowane w sekwencjach wejściowych podczas szkolenia, abyśmy mogli pobrać wszystkie tokeny odpowiadające działaniu robo Korzystamy z funkcji sparse reward opisanej w filtrowanie danych. wybieramy tylko ścieżki z Sukces zadania, czyli niewielka nagroda w wysokości 1 na ostatnim etapie. Lee et al. Śląskie (2021) Ostatecznie H Skill Mastery Architektura The numbers reported for the Skill Mastery benchmark were collected by executing a model zero-shot that used an earlier version of the Gato architecture. Instead of the ResNet patch embedding, a similar architecture using a local transformer was used to embed image patch tokens. The local position embeddings and patch position embeddings were not used. These changes were implemented and found to improve Gato’s performance after the pretraining data was changed (as we decided to focus on Skill Generalization instead of Skill Mastery challenge), which is why they are presented as the final architecture of our full model. I Additional robotics ablations Przeprowadziliśmy serię ablacji w symulacji, aby lepiej zrozumieć wpływ różnorodnych danych przedtreningowych w dziedzinie robotyki (patrz Rysunek). We included the same baselines as in Section wybierając wariant rozmiaru parametru 364M, a także dodatkową bazę szkoloną tylko z danymi z pakietu sterowania. DM Control-only agent jest lepszy od bazy Gato przy transferze zero-shot i z dużą ilością danych fine-tuning, co sugeruje, że Gato może nie korzystać z reprezentacji wyuczonych z zbiorów danych opartych na tekście podczas dostosowywania się do zadań robotyki. Ten sam agent tylko domeny wykonuje najlepsze ogólne, dopasowując bazę CRR w 1 epizodzie fine-tuning i wyprzedzając go z większą ilością danych, sugerując, że Gato w bieżącej skali może handlować swoją zdolnością generalizacji dla efektywnej dostosowywania danych. 19). 5.2 i 2 J Uwaga wizualizacja Aby odzyskać wagę uwagi transformatora, odzyskaliśmy logity krzyżowe uwagi, tensor z wymiarem ( ) where is the number of heads and jest liczba tokenów w sekwencji. ( Wpis tej matrycy można interpretować jako kwotę, która Czekamy na token z tokenem . Due to Gato’s image tokenization scheme, there are multiple tokens per timestep. Therefore to render the attention for a particular timestep, we took the sub-matrix that corresponds to that timestep. We then applied a softmax over the rows of this matrix to normalize the relevant values. Because we are only interested in attention to the previous tokens, we excluded the diagonal by setting it to negative infinity before softmax. H, T, T H T H, I, J h j i To measure the importance of each patch, we averaged the attention weights over the corresponding column. Because Gato uses a causal transformer, the attention matrix is lower triangular, so the mean was only considered over the sub-column below the diagonal of the matrix. This corresponds to the average attention paid to particular patch over a whole timestep. Korzystając z tej metody, stwierdziliśmy, że mapy uwagi na pierwszej warstwie transformatora są najbardziej interpretowalne, zgadzając się z ustaleniami Niektóre nagłówki wyraźnie śledzą jednostki i regiony obrazu specyficzne dla zadania. wyświetla mapy uwagi dla ręcznie wybranych głów w pierwszej warstwie dla kilku zadań. Abnar & Zuidema I tak (2020) 20 K Detailed results for specialist Meta-World agent Specjalista Meta-World agent opisany w sekcji osiąga średnią stopę sukcesu 96,6% we wszystkich 50 zadaniach Meta-World. Ocenialiśmy agenta 500 razy dla każdego zadania. 5.5 7. L Per-domain wyniki dla Gato We describe performance of Gato for simulated control tasks in Section Na stole Przedstawiamy znormalizowane wyniki per-domain. Ocenialiśmy agenta 50 razy dla każdego zadania. 4.1 w 8 , Niniejszy dokument jest dostępny w archiwum pod licencją CC by 4.0 Deed (Attribution 4.0 International). Ten dokument jest Licencja CC by 4.0 Deed (Attribution 4.0 International). Dostępne w Archiwum