Gato của DeepMind cho thấy cách một AI có thể học tất cả mọi thứ cùng một lúc

Tác giả : Scott Reed Konrad Żołna Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Giménez Yury Sulsky Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas Tác giả : bởi Scott Reed Konrad Đám cưới Đạo diễn Emilio Parisotto Tổng thống Sergio Gómez Colmenarejo Ông Alexander Novikov Gabriel Barth-Maron Mai Trường Yuri Sulsky bởi Jackie Kay Diễn viên: Jost Tobias Springenberg bởi Tom Eccles Đạo diễn Jake Bruce Ali Đánh giá Nữ diễn viên Ashley Edwards Ông Nicolas Heess Yến Chen Raia Hadsell Oriol Vinyals Mahyar Bordeaux Lời bài hát: Nando De Freitas Abstracts Lấy cảm hứng từ sự tiến bộ trong mô hình ngôn ngữ quy mô lớn, chúng tôi áp dụng một cách tiếp cận tương tự để xây dựng một đại lý tổng quát duy nhất vượt ra ngoài phạm vi đầu ra văn bản. Các đại lý, mà chúng tôi gọi là Gato, hoạt động như một chính sách tổng quát đa phương thức, đa nhiệm vụ, đa thể hiện. Mạng tương tự với cùng trọng lượng có thể chơi Atari, hình ảnh chữ ký, trò chuyện, khối tích hợp với một cánh tay robot thực và nhiều hơn nữa, quyết định dựa trên bối cảnh của nó cho dù để đầu ra văn bản, xoắn xoắn, nhấn nút, hoặc các token khác. Trong báo cáo này chúng tôi mô tả mô hình và dữ liệu, và tài liệu khả năng hiện tại của Gato. 1 Giới thiệu Có những lợi ích đáng kể khi sử dụng một mô hình chuỗi thần kinh duy nhất trên tất cả các nhiệm vụ. Nó làm giảm sự cần thiết cho các mô hình chính sách thủ công với các thiên vị gây ấn tượng thích hợp cho mỗi miền. Nó làm tăng số lượng và sự đa dạng của dữ liệu đào tạo vì mô hình chuỗi có thể hấp thụ bất kỳ dữ liệu nào có thể được serialized thành một chuỗi phẳng. Hơn nữa, hiệu suất của nó tiếp tục cải thiện ngay cả ở biên giới của dữ liệu, tính toán và quy mô mô mô hình Trong lịch sử, các mô hình chung tốt hơn trong việc tận dụng tính toán cũng có xu hướng vượt qua các cách tiếp cận cụ thể hơn. Cuối cùng (Công ty và al., năm 2020; Hoffmann và al. năm 2022). (Nhạc Chuông, năm 2019), Trong bài báo này, chúng tôi mô tả sự lặp lại hiện tại của một đại lý có mục đích chung mà chúng tôi gọi là Gato, được thể hiện dưới dạng một mô hình chuỗi biến áp đơn, lớn.Với một bộ trọng lượng duy nhất, Gato có thể tham gia vào các cuộc đối thoại, hình ảnh phụ đề, xếp khối với một cánh tay robot thực sự, vượt trội hơn con người khi chơi các trò chơi Atari, điều hướng trong môi trường 3D mô phỏng, làm theo hướng dẫn, và nhiều hơn nữa. Mặc dù không có đại lý nào có thể được mong đợi để vượt trội trong tất cả các nhiệm vụ kiểm soát có thể tưởng tượng, đặc biệt là những nhiệm vụ vượt xa sự phân phối đào tạo của nó, chúng tôi ở đây kiểm tra giả thuyết rằng đào tạo một đại lý thường có khả năng của nhiệm vụ là có thể; và rằng đại lý chung này có thể được điều chỉnh với ít dữ liệu bổ sung để thành công trong một số lượng lớn hơn nữa của nhiệm vụ. Chúng tôi giả định rằng một đại lý như vậy có thể được thu được thông qua việc mở rộng dữ liệu, tính toán và các thông số mô hình, liên tục mở rộng phân phối đào tạo trong khi duy trì hiệu suất, hướng tới bao gồm bất kỳ nhiệm vụ, hành vi và thực hiện quan tâm. Trong thiết lập này, lan-guage tự nhiên có thể đóng vai trò như một nền tảng chung trên các thực hiện không tương thích khác, mở khóa tổng quát kết hợp cho hành vi mới. Số lượng lớn Chúng tôi tập trung đào tạo của chúng tôi vào điểm hoạt động của quy mô mô mô hình cho phép kiểm soát trong thời gian thực của robot thế giới thực, hiện đang ở khoảng các thông số 1.2B trong trường hợp của Gato. Khi kiến trúc phần cứng và mô hình cải thiện, điểm hoạt động này sẽ tự nhiên tăng kích thước mô hình có thể thực hiện, đẩy các mô hình tổng quát lên đường cong quy mô cao hơn. Đối với sự đơn giản, Gato đã được đào tạo ngoại tuyến theo cách hoàn toàn giám sát; tuy nhiên, về nguyên tắc, không có lý do gì mà nó cũng không thể được đào tạo với học tập tăng cường ngoại tuyến hoặc trực tuyến (RL). 2 Mô hình Nguyên tắc thiết kế hướng dẫn của Gato là đào tạo về sự đa dạng nhất có thể của dữ liệu có liên quan, bao gồm các phương thức khác nhau như hình ảnh, văn bản, proprioception, gián điệp chung, nhấn nút và các quan sát và hành động khác kín đáo và liên tục. Để cho phép xử lý dữ liệu đa phương thức này, chúng tôi sắp xếp tất cả dữ liệu thành một chuỗi token phẳng. Trong trình bày này, Gato có thể được đào tạo và lấy mẫu từ một mô hình ngôn ngữ quy mô lớn tiêu chuẩn. Trong quá trình triển khai, token được lấy mẫu được lắp ráp thành các phản hồi đối thoại, phụ đề, nhấn nút hoặc các hành động khác dựa trên ngữ cảnh. Trong các phần dưới đây, chúng tôi mô tả tokenization của Gato, kiến trúc mạng, chức năng và triển khai. 2.1 Token hóa Có vô số cách có thể chuyển đổi dữ liệu thành token, bao gồm trực tiếp sử dụng dòng byte thô cơ bản.Dưới đây chúng tôi báo cáo chương trình token hóa mà chúng tôi tìm thấy để tạo ra kết quả tốt nhất cho Gato ở quy mô hiện tại bằng cách sử dụng kiến trúc phần cứng và mô hình hiện đại. Văn bản được mã hóa thông qua SentencePiece (Kudo & Richardson, 2018) với 32.000 phụ từ vào phạm vi số nguyên [0, 32.000]. Hình ảnh đầu tiên được chuyển đổi thành các chuỗi không trùng lặp 16 16 bản vá theo thứ tự raster, như được thực hiện trong ViT (Dosovitskiy et al., 2020). mỗi pixel trong hình ảnh __p__atches sau đó được bình thường hóa giữa [−1*,* 1] và được chia bằng hình vuông gốc của kích thước bản vá (tức là √16 = 4). Các giá trị phân biệt, ví dụ như các nút Atari, được phẳng thành các chuỗi số nguyên theo thứ tự hàng chính. Kết quả tokenized là một chuỗi số nguyên trong phạm vi [0*,* 1024). Các giá trị liên tục, ví dụ như đầu vào proprioceptive hoặc torque kết hợp, trước tiên được phẳng thành các chuỗi các giá trị điểm nổi theo thứ tự lớn. Các giá trị được mã hóa mu-law vào phạm vi [ 1*,* 1] nếu không có (xem hình 14 để biết chi tiết), sau đó được phân biệt thành 1024 bin đồng nhất. Sau khi chuyển đổi dữ liệu thành token, chúng tôi sử dụng thứ tự canonical sau. Văn bản token theo thứ tự giống như văn bản nhập thô. Hình ảnh patch token theo thứ tự raster. Tensors theo thứ tự lớn. Các cấu trúc tổ hợp theo thứ tự lexicographical by key. Các giai đoạn thời gian của đại lý như token quan sát được theo sau bởi một bộ tách, sau đó là token hành động. Agent Episodes như timestops theo thứ tự thời gian. Thêm chi tiết về dữ liệu đại lý tokenizing được trình bày trong tài liệu bổ sung (Phần b) 2.2 Nhúng token đầu vào và thiết lập mục tiêu đầu ra Sau khi token hóa và sequencing, chúng tôi áp dụng một chức năng nhúng tham số * f * ( ; *θe*) cho mỗi token (tức là nó được áp dụng cho cả quan sát và hành động) để tạo ra đầu vào mô hình cuối cùng. Để cho phép học tập hiệu quả từ chuỗi đầu vào đa phương thức của chúng tôi *s*1:*L* chức năng nhúng thực hiện các hoạt động khác nhau tùy thuộc vào phương thức mà token bắt nguồn từ: • Token thuộc về văn bản, các quan sát hoặc hành động được đánh giá kín đáo hoặc liên tục cho bất kỳ bước thời gian nào được nhúng thông qua bảng tìm kiếm vào một không gian nhúng vector đã học. • Token thuộc về các bản vá hình ảnh cho bất kỳ bước thời gian nào được nhúng bằng cách sử dụng một ResNet duy nhất block để có được một vector cho mỗi patch.For image patch token embeddings, we also add a learable within-image position encoding vector. (Họ và Al. năm 2016a) Chúng tôi đề cập đến phần Phụ lục Để biết chi tiết về chức năng nhúng. C.3 Khi chúng ta mô hình hóa dữ liệu theo cách tự động, mỗi token cũng có khả năng là một nhãn mục tiêu do các token trước đó đưa ra. token văn bản, giá trị phân biệt và liên tục, và hành động có thể được đặt trực tiếp làm mục tiêu sau khi token hóa. token hình ảnh và quan sát phi văn bản của đại lý hiện không được dự đoán trong Gato, mặc dù đó có thể là một hướng thú vị cho công việc trong tương lai. mục tiêu cho các token không được dự đoán này được đặt lên một giá trị chưa được sử dụng và đóng góp của họ cho sự mất mát được che giấu. 2.3 Đào tạo Dựa trên một chuỗi token 1 : Các Parameter , chúng tôi mô hình hóa dữ liệu bằng cách sử dụng quy tắc chuỗi xác suất: s L θ Let đi Chúng tôi định nghĩa một hàm ẩn *m* để *m*(*b, l*) = 1 nếu mã thông báo ở chỉ mục *l* là từ văn bản hoặc từ hành động được ghi nhật ký của một đại lý, và 0 nếu không. b Như đã mô tả ở trên, kiến trúc mạng của Gato có hai thành phần chính: chức năng nhúng tham số biến token thành nhúng token, và mô hình chuỗi xuất ra một phân phối trên token phân biệt tiếp theo. cho sự đơn giản và khả năng mở rộng. Gato sử dụng một bộ chuyển đổi 1.2B tham số decoder-only với 24 lớp, một kích thước nhúng 2048, và một nguồn cấp dữ liệu sau sự chú ý kích thước ẩn của 8196 (thêm chi tiết trong phần Aswan et al, năm 2017 C.1 ) Bởi vì các nhiệm vụ khác nhau trong một miền có thể chia sẻ các mô hình thực hiện giống hệt nhau, định dạng quan sát và thông số kỹ thuật hành động, mô hình đôi khi cần thêm bối cảnh để phân biệt các nhiệm vụ. và sử dụng điều kiện nhanh. Trong quá trình đào tạo, cho 25% các chuỗi trong mỗi lô, một chuỗi nhanh được đặt trước, đến từ một tập phim được tạo ra bởi cùng một đại lý nguồn trên cùng một nhiệm vụ. Một nửa các chuỗi nhanh là từ cuối tập phim, hoạt động như một hình thức điều kiện mục tiêu cho nhiều lĩnh vực; và nửa còn lại được lấy mẫu đồng đều từ tập phim. Trong quá trình đánh giá, đại lý có thể được yêu cầu sử dụng một minh chứng thành công của nhiệm vụ mong muốn, mà chúng tôi làm theo mặc định trong tất cả các kết quả kiểm soát mà chúng tôi trình bày ở đây. (Chủ tịch và al., năm 2022; Nguyễn và al, năm 2021; Brown và al. Năm 2020) Đào tạo mô hình được thực hiện trên một mảnh 16x16 TPU v3 cho các bước 1M với kích thước lô 512 và chiều dài chuỗi token = 1024, mất khoảng 4 ngày. chi tiết kiến trúc có thể được tìm thấy trong phần Bởi vì các tập và tài liệu đại lý có thể dễ dàng chứa nhiều token hơn so với phù hợp với bối cảnh, chúng tôi ngẫu nhiên lấy mẫu các phần tiếp theo của Mỗi lô trộn các phần tiếp theo gần như đồng đều trên các miền (ví dụ: Atari, MassiveWeb, v.v.), với một số cân bằng thủ công của các tập dữ liệu lớn hơn và chất lượng cao hơn (xem Bảng). Trong phần cho các chi tiết). L C . L 1 3 2.4 Việc triển khai Việc triển khai Gato như một chính sách được minh họa trong Hình Đầu tiên, một prompt, chẳng hạn như một bản trình bày, được tokenized, tạo thành chuỗi ban đầu. Theo mặc định, chúng tôi lấy các token đầu tiên 1024 của bản trình bày. Tiếp theo, môi trường tạo ra quan sát đầu tiên được tokenized và gắn vào chuỗi. Gato lấy mẫu vector hành động một cách tự động một token tại một thời điểm. Sau khi tất cả các token bao gồm vector hành động đã được lấy mẫu (được xác định bởi thông số kỹ thuật hành động của môi trường), hành động được giải mã bằng cách đảo ngược quy trình tokenization được mô tả trong Phần Hành động này được gửi đến môi trường mà bước và tạo ra một quan sát mới. Thủ tục lặp đi lặp lại. Mô hình luôn nhìn thấy tất cả các quan sát và hành động trước đó trong cửa sổ ngữ cảnh của nó là 1024 token. Chúng tôi thấy hữu ích khi sử dụng bộ nhớ Transformer XL trong quá trình triển khai, mặc dù nó không được sử dụng trong quá trình đào tạo 3. 2.1 Đánh giá (Dai và al, năm 2019). 3 Dữ liệu Gato được đào tạo về một số lượng lớn các tập dữ liệu bao gồm kinh nghiệm đại lý trong cả môi trường mô phỏng và thực tế, cũng như một loạt các tập dữ liệu ngôn ngữ tự nhiên và hình ảnh. Số lượng token xấp xỉ cho mỗi tập dữ liệu điều khiển được tính toán với giả định cơ chế token hóa được mô tả trong phần 1. 2.1 Đánh giá 3.1 Các nhiệm vụ kiểm soát mô phỏng Nhiệm vụ kiểm soát của chúng tôi bao gồm các tập dữ liệu được tạo ra bởi các chuyên gia SoTA hoặc các đại lý học tập tăng cường gần SoTA được đào tạo trong một loạt các môi trường khác nhau. Đối với mỗi môi trường, chúng tôi ghi lại một tiểu tập kinh nghiệm mà đại lý tạo ra (những trạng thái, hành động và phần thưởng) trong khi nó đang được đào tạo. Môi trường mô phỏng bao gồm Meta-World (Y) Giới thiệu về học tập tăng cường meta và học tập đa nhiệm, Sokoban đề xuất như một vấn đề lập kế hoạch, BabyAI cho các hướng dẫn ngôn ngữ tiếp theo trong thế giới lưới, DM Control Suite (T để kiểm soát liên tục, cũng như DM Lab được thiết kế để dạy các đại lý điều hướng và tầm nhìn 3D từ các pixel thô với một quan điểm ích kỷ. với các trò chơi Atari cổ điển (chúng tôi sử dụng hai bộ trò chơi mà chúng tôi gọi là ALE Atari và ALE Atari Extended, xem phần cho các chi tiết). Ông và al, Năm 2020) (Tác giả et al., năm 2017 (Chủ tịch Nguyễn Thị Trọng et al., năm 2018 Nguyễn Thị Hương et al. 2020) (Thuyết minh và al., năm 2016) (Thuyết minh và al. năm 2013) F1 We as well include the Procgen Benchmark Mô-đun RL Chúng tôi cũng bao gồm bốn nhiệm vụ sử dụng một cánh tay Kinova Jaco mô phỏng từ DM Manipulation Playground, như được giới thiệu trong Bộ phận bao gồm một mô tả sâu hơn về các nhiệm vụ điều khiển này, cùng với các đại lý RL được sử dụng để tạo ra dữ liệu. (Thuyết minh và al. Năm 2020) (Người Việt và al., năm 2020). Zola et al. 2020 F Chúng tôi thấy hiệu quả khi đào tạo trên một tập hợp các tập phim được lọc với ít nhất 80% trở lại của chuyên gia cho nhiệm vụ. Trả lại chuyên gia đo hiệu suất duy trì tối đa mà chuyên gia có thể đạt được. Chúng tôi xác định nó là tối đa trên tập hợp của tất cả các tập trung trung bình được tính toán trên tất cả các tập phim được thu thập cho một nhiệm vụ: Nơi Nó là tổng số tập phim được thu thập cho nhiệm vụ, là kích thước cửa sổ, và là sự trở lại toàn bộ cho tập Để có được ước tính chính xác, trong thực tế, chúng tôi thiết lập phải là 10% tổng số dữ liệu hoặc tối thiểu là 1000 tập (tức là = phút(1000*,* 0*.*1 ) ) N W Ri i W W × N 3.2 Tầm nhìn và ngôn ngữ Gato được đào tạo trên MassiveText một bộ sưu tập các tập dữ liệu văn bản lớn bằng tiếng Anh từ nhiều nguồn: các trang web, sách, bài báo và mã. (Thuyết minh và al. năm 2021), Chúng tôi cũng đã bao gồm một số tập dữ liệu ngôn ngữ thị giác trong đào tạo của Gato. ALIGN bao gồm 1,8B hình ảnh và các ghi chú văn bản thay thế (alt-text) của chúng. LTIP (Long Text & Image Pairs), bao gồm 312 triệu hình ảnh với phụ đề , Khái niệm Captions Các Coco Captions , Các tập dữ liệu với các cặp hình ảnh và văn bản 3.3M và 120k tương ứng.MultiModal MassiveWeb (M3W) ., bao gồm 43M trang web nơi cả văn bản và hình ảnh đã được trích xuất. Chúng tôi cũng bao gồm các tập dữ liệu trả lời câu hỏi trực quan. Đánh giá VQAv2 với 9K và 443K bộ ba hình ảnh, câu hỏi và câu trả lời. Để hình thành một tập huấn từ những điều này, chúng tôi lấy mẫu năm cặp (hình ảnh, văn bản), tokenize chúng, concatenate, và sau đó pad hoặc ngẫu nhiên crop đến độ dài chuỗi đào tạo cần thiết. (Thuyết minh và al. năm 2021) (Truyền sư et al. năm 2022). (Thuyết minh và al. năm 2018 (Chen và al. năm 2015) (Truyền và al năm 2022) (Marino và al, năm 2019) (Nhân vật và al. năm 2015) 3.3 Robotics - RGB Stacking Benchmark (thực tế và sim) Là một bộ dữ liệu thử nghiệm để thực hiện các hành động vật lý trong thế giới thực, chúng tôi đã chọn môi trường xếp chồng khối robot được giới thiệu bởi [Lee et al.](#_bookmark89) [(2021).](#_bookmark89) Môi trường bao gồm một cánh tay robot Sawyer với điều khiển tốc độ cartesian 3-DoF, một DoF bổ sung cho tốc độ, và một hành động gripper rời rạc. Không gian làm việc của robot chứa ba khối nhựa màu đỏ, xanh lá cây và xanh lá cây với hình dạng khác nhau. Các quan sát có sẵn bao gồm hai 128 hình ảnh máy ảnh, cánh tay robot và các góc liên kết của gripper cũng như vị trí hiệu ứng cuối của robot. Đặc biệt, thông tin thực tế mặt đất cho ba đối tượng trong giỏ không được quan sát bởi tác nhân. Các tập phim có độ dài cố định We used several sources of training data for these tasks. In Skill Generalization, for both simulation and real, we use data collected by the best generalist sim2real agent from Chúng tôi chỉ thu thập dữ liệu khi tương tác với RGB-stacking được chỉ định (this amounts to a total of 387k successful trajectories in simulation and 15k trajectories in real). For Skill Mastery we used data from the best per group experts from trong mô phỏng và từ chính sách sim2real tốt nhất trên robot thực tế (tổng cộng lên đến 219k quỹ đạo). lưu ý rằng dữ liệu này chỉ được bao gồm cho các thí nghiệm kỹ năng cụ thể trong Phần Lee và al. ( 2021 ) training objects Lee và al. (2021) 5.4. 4 Capabilities of the generalist agent In this section, we summarize the performance of Gato when trained on the above described data. That is, all results across all tasks are derived from a single pretrained model with a single set of weights. Results with fine-tuning will be presented in Section 5. 4.1 Các nhiệm vụ kiểm soát mô phỏng Figure shows the number of distinct control tasks for which Gato performs above a given score threshold, relative to expert performance demonstrated in Gato’s training data. 5 We report performance as a percentage, where 100% corresponds to the per-task expert and 0% to a random policy. For each simulated control task we trained our model on, we roll out the Gato policy on the corresponding environment 50 times and average the defined scores. As shown in Figure Gato thực hiện hơn 450 trong số 604 nhiệm vụ với hơn 50% điểm số chuyên gia. 5, In ALE Atari Gato đạt điểm trung bình của con người (hoặc tốt hơn) cho 23 trò chơi Atari , achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section where we present a specialist single domain ALE Atari agent achieving better than human scores for 44 games). (Thuyết minh và al. 2013) 1 5.5 On BabyAI Gato achieves over 80% of expert score for nearly all levels . For the most difficult task, called BossLevel, Gato scores 75%. The two other published baselines we could find, BabyAI 1.0 and BabyAI 1.1 , scored 77% and 90%, respectively, having trained on this single task alone using a million demonstrations. (Chevalier-Boisvert et al., 2018) 2 (Huy và al. 2020), Trên Meta-World (Y Gato achieves more than 50% for all 44 out of 45 tasks that we trained on, over 80% for 35 tasks, and over 90% for 3 tasks. On canonical DM Control Suite (T Gato achieves better than 50% of the expert score on 21 out of 30 tasks from state, and more than 80% for 18 tasks. Ông và al, 2020) assa et al., 2018), 4.2 Robot First person teleoperation enables the collection of expert demonstrations. However, such demonstrations are slow and costly to collect. Data-efficient behavior cloning methods are therefore desirable for training a generalist robot manipulator and offline pretraining is thus a well-motivated area of research. To that end, we evaluated Gato on the established RGB Stacking benchmark for robotics. Skill Generalization Performance The Skill Generalization challenge from the RGB Stacking robotics benchmark tests the agent’s ability to stack objects of previously unseen shapes. The agent is trained on a dataset consisting of episodes of the robot stacking objects with a variety of different shapes. Five triplets of object shapes are, however, not included in the training data and serve as test triplets. We evaluated the trained generalist for 200 episodes per test triplet on the real robot. Table shows that our generalist agent’s success rate on each test triplet is comparable to the single task BC-IMP (filtered BC) baseline in 2 Lee và al. (2021). 4.3 Text samples The model demonstrates rudimentary dialogue and image captioning capabilities. Figure chứa một mẫu rep-resentative của hình ảnh của Gato captioning performance. shows some hand-picked examples of plain text dialogue exchange. 6 7 5 Analysis 5.1 Phân tích quy luật In Figure we analyze the aggregate in-distribution performance of the pretrained model as a function of the number of parameters in order to get insight into how performance could improve with increased model capacity. We evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato). We refer to Section for details on the three model architectures. 8, C Here, for all three model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 1) Sau đó cho mỗi tên miền được liệt kê trong bảng we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a significant performance improvement with increased scale. 4. 1 5.2 Out of distribution tasks In this section we want to answer the following question: For this reason, we held-out all data for four tasks from our pre-training set: cartpole.swingup (DM Control Suite domain), assembly-v2 (Meta-World domain), order_of_apples_forage_simple (DM Lab domain), and boxing (ALE Atari domain). These four tasks will serve as testbeds for evaluating the out-of-distribution capabilities of Gato. Can our agent be used to solve a completely new task efficiently? Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment. Fine-tuning is very similar to pretraining with minor changes, such as different learning rate schedule; see Section for details. E We want to measure how choice of data used during pretraining influences post-fine-tuning performance. To this end, we compare Gato (trained on ) to variants trained on ablated datasets: Tất cả dữ liệu 1. A model pretrained only on data from the same domain as the task to be fine-tuned on, . same domain only data 2. A model pretrained only on non-control data, . no control data Một mô hình được điều chỉnh tốt từ đầu, tức là không có quá trình đào tạo trước, . scratch Considering as all these experiments require training a new model from scratch and then also fine-tuning, we present results using the less compute-intensive 364M parameter architecture described in Section Results are shown in Figure 5.1. 9. Fine-tuning performance on both cartpole.swingup and assembly-v2 tasks, both of which do not require image processing, present similar trends. Pretraining on all the datasets yields the best results, followed by pretraining on the same domain only. This difference is smaller for assembly-v2 but consistent for all few shot datasets. For these non-image-based environments, we see either no benefit (cartpole.swingup) or even negative transfer (assembly-v2) when pretraining on datasets, which only contain images and text data. no control Results for DM Lab order_of_apples_forage_simple are slightly different. Pretraining on DM Lab data only is already enough to approach the maximum reward of 19 and hence there is no observable benefit of adding data from different environments. What is different when compared to previously analysed no-vision environments is that pretraining on data helps, which can be possibly explained by the fact that agents in the DM Lab environment are fed images which, despite being simulated, are natural looking. Therefore, transfer from image captioning or visual grounded question answering tasks is possible. Không kiểm soát Chúng tôi không thể quan sát được bất kỳ lợi ích nào từ việc luyện tập trước về boxing. Mô hình bắt đầu ngẫu nhiên dường như hoạt động tốt hơn bất kỳ biến thể được đào tạo trước nào được xem xét. Chúng tôi giả định rằng điều này là do hình ảnh đầu vào của trò chơi rất khác biệt về mặt thị giác với các dữ liệu khác, cho thấy việc chuyển giao là khó khăn. 5.3 Fine-tuning on Robotic Stacking Tasks Section demonstrates that the base Gato capable of a diverse array of tasks can perform competitively on the RGB Stacking Skill Generalization benchmark. In this section, we would like to answer the following question: *How does our agent improve on robotics tasks when allowed to fine-tune similarly to how we fine-tune on new tasks in Section *Chúng tôi xem xét các kích thước mô hình khác nhau và phân tích tác động của tập dữ liệu trước đào tạo đối với tiêu chuẩn tổng quát Skill Generalization, cũng như một nhiệm vụ phân phối mới. 4.2 5.2 Không I. Skill Generalization First, we would like to show that fine-tuning on object-specific data, similarly to what was done by is beneficial. Therefore, we fine-tuned Gato separately on five subsets of demonstrations from the dataset. Each subset was obtained by random partitioning of a test dataset consisting of demonstrations gathered by a generalist sim-to-real agent stacking real test objects. We consider this setting, which is comparable to the fine-tuning baselines on RGB stacking tasks from and use the 5k dataset that their behavior cloning 5k results are obtained with. To best match their experiments, we change our return filtering scheme during training: instead of using only successful stacks, we condition on the normalized return of the episode. Lee et al. (2022), test (Lee et al., 2022); Figure compares the success rate of Gato across different fine-tuning data regimes to the sim-to-real expert and a Critic-Regularized Regression (CRR) agent trained on 35k episodes of all test triplets. Gato, in both reality and simulation (red curves on the left and right figure, respectively), recovers the expert’s performance with only 10 episodes, and peaks at 100 or 1000 episodes of fine-tuning data, where it exceeds the expert. After this point (at 5000), performance degrades slightly but does not drop far below the expert’s performance. 10 (Wang et al., 2020) Fine-tuning and Model Size To better understand the benefit of large models for few-shot adaptation in robotics domains, we conducted an ablation on model parameter size. This section focuses on in-simulation evaluation. Figure compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time. 10 Adaptation to Perceptual Variations While the Skill Generalization task is an effective benchmark for motor Skill Generalization to shape varia-tions, it does not test the agent’s ability to adapt to perceptual variations and permutations in the objective specification. To further evaluate Gato’s generalization capabilities, we devised a new task in the RGB stacking benchmark where the goal is to stack the blue object on the green object, for test triplet 1 (see Figure First, we used a 3D mouse to collect 500 demonstrations of this task on the real robot, for a total of 2 hours and 45 minutes of demonstration data, and fine-tuned Gato on these episodes. Notably, all of the simulated and real robotics data in the pretraining set shows the robot successfully stacking the red object on the blue object, and the data does not include the object shapes in the test set. We found that additionally adding simulated demonstrations of the stack blue on green task to the fine-tuning dataset improved performance, and 10% was an ideal sampling ratio for this data. 11). We achieved a final 60% success rate after evaluating fine-tuned Gato on the real robot, while a BC baseline trained from scratch on the blue-on-green data achieved only 0.5% success (1/200 episodes). Qualitatively, the BC baseline would consistently move towards the blue object and occasionally pick it up and place it on top of the green object, but a full, stable stack was almost never achieved. 5.4 Robotics: Kỹ năng làm chủ Similarly to the Skill Generalization challenge discussed in Section the Skill Mastery challenge consists in training a robotic arm to stack blocks of different shapes. However, the Skill Mastery allows the agent to train on data involving the object shapes used for evaluation, i.e. the set in Skill Generalization becomes a part of the Skill Mastery Vì vậy, thách thức này phục vụ để đo lường hiệu suất của Gato trên các nhiệm vụ trong phân phối (có thể với các điều kiện ban đầu không được nhìn thấy trong các cuộc biểu tình đào tạo). kết quả Kỹ năng của chúng tôi sử dụng một phiên bản trước của kiến trúc Gato được mô tả trong Phụ lục with no fine-tuning. 4.2, test training H, Table compares the group-wise success percentage and the average success across object groups for Gato and the established BC-IMP baseline. Gato exceeds or closely matches BC-IMP’s performance on all but one training triplet. 3 5.5 Specialist single-domain multi-task agents Trong phần này, chúng tôi hiển thị kết quả thu được với hai đại lý chuyên gia (thay vì tổng quát). cả hai đều được đào tạo về dữ liệu từ một miền duy nhất và được triển khai 500 lần cho mỗi nhiệm vụ đào tạo mà không có bất kỳ điều chỉnh chi tiết cho mỗi nhiệm vụ. Meta-World The first agent uses the smallest architecture introduced in Section i.e. 79M parameters, and is trained on all 50 Meta-World tasks. While Gato has access to the state of the MuJoCo physics engine and unlimited task seeds, the agent presented here has no access to any extra features or tasks and uses the canonical API as in (Y Thí nghiệm này là để cho thấy rằng kiến trúc được đề xuất trong bài báo của chúng tôi có thể được sử dụng để có được các đại lý hiện đại ngay cả ở quy mô nhỏ. experts on each of the MT-50 tasks individually, recording the trajectories produced while training. This experience is then combined, or distilled, into a single agent, which achieves 96.6% success rate averaged over all 50 tasks. To the best of our knowledge this agent is the first one to accomplish nearly 100% average success rate simultaneously (multi-task) for this benchmark. See Table in the supplementary material (Section cho danh sách đầy đủ các nhiệm vụ và tỷ lệ thành công tương ứng của đại lý của chúng tôi. 5.1, Ông và al, 2020). (Abdolmaleki et al., 2018) 7 K) Lời bài hát: Atari We also trained a specialist agent on all 51 ALE Atari tasks. As the Atari domain is much more challenging than Meta-World, we used the Gato architecture with 1.18B parameters. The resulting agent performs better than the average human for 44 games (see Section for details on our evaluation and scoring). We want to note that the performance of online experts used to generate training data for the other 7 games were also below the average human. Hence, the specialist Atari agent achieved better than human performance for all games where data contained super-human episodes. 4.1 The specialist Atari agent outperforms our generalist agent Gato, which achieved super-human performance on 23 games. It suggests that scaling Gato may result in even better performance. We, however, purposely restricted Gato’s size such that it can be run in real-time on the real robot. 5.6 Phân tích sự chú ý We rendered the transformer attention weights over the image observations for various tasks, to gain a qualitative sense of how Gato attends to different regions of the image across tasks (see Figure Further details and visualizations for more tasks can be found in Appendix These visualizations clearly show that attention tracks the task-relevant objects and regions. 12 ) J. 5.7 Embedding Visualization To understand how Gato encodes differently information per task, we visualized per-task embeddings. We analysed 11 tasks. For each task, we randomly sample 100 episodes and tokenize each of them. Then, from each episode we take a subsequence of 128 tokens, compute their embeddings (at layer 12, which is half the total depth of the transformer layers) and average them over the sequence. The averaged embeddings for all tasks are used as input to PCA, which reduces their dimensionality to 50. Then, T-SNE is used to get the final 2D embeddings. Figure Hiển thị các bản nhúng T-SNE cuối cùng được phác thảo trong 2D, được tô màu theo nhiệm vụ. Các bản nhúng từ cùng một nhiệm vụ được nhóm rõ ràng với nhau, và các cụm nhiệm vụ từ cùng một miền và phương thức cũng được đặt gần nhau. Ngay cả nhiệm vụ được tổ chức (cartpole.swingup) cũng được nhóm đúng và đặt bên cạnh nhiệm vụ khác từ DM Control Suite Pixels. 13 6 Related Work The most closely related architectures to that of Gato are Decision Transformers , and Trajectory Transformer which showed the usefulness of highly generic LM-like architectures for a variety of control problems. Gato also uses an LM-like architecture for control, but with design differences chosen to support multi-modality, multi-embodiment, large scale and general purpose deployment. Pix2Seq also uses an LM-based architecture for object detection. Perceiver IO ., uses a transformer-derived architecture specialized for very long sequences, to model any modality as a sequence of bytes. This and similar architectures could be used to expand the range of modalities supported by future generalist models. (Chen et al., 2021b; Reid et al., 2022; Zheng et al., 2022; Furuta et al. 2021) (Janner et al., 2021), (Chen et al., năm 2022) (Jaegle et al năm 2021) Gato được lấy cảm hứng từ các tác phẩm như GPT-3 and Gopher pushing the limits of generalist language models; and more recently the Flamingo Mô hình ngôn ngữ tổng quát phát triển mô hình ngôn ngữ tham số 540B Pathways (PalM) rõ ràng như là một người học chung cho hàng trăm nhiệm vụ văn bản. (Brown et al., 2020) (Rae et al., 2021), (Alayrac et al., 2022) Chowdhery et al. (2022) Future work should consider how to unify these text capabilities into one fully generalist agent that can also act in real time in the real world, in diverse environments and embodiments. Gato also takes inspiration from recent works on multi-embodiment continuous control. used message passing graph networks to build a single locomotor controller for many simulated 2D walker variants. showed that transformers can outperform graph based approaches for incom-patible (i.e. varying embodiment) control, despite not encoding any morphological inductive biases. learn a modular policy for multi-task and multi-robot transfer in simulated 2D manipulation environments. đào tạo một chính sách phổ quát có điều kiện trên một đại diện vector của phần cứng robot, cho thấy sự chuyển giao thành công cho cả hai vũ khí robot được mô phỏng và cho một cánh tay robot sawyer thế giới thực. Huang et al. (2020) Kurin et al. (2020) Devin et al. (2017) Chen et al. (2018) A variety of earlier generalist models have been developed that, like Gato, operate across highly distinct domains and modalities. NPI Đào tạo một LSTM duy nhất to execute diverse programs such as sorting an array and adding two numbers, such that the network is able to generalize to larger problem instances than those seen during training. Phát triển MultiModel mà đào tạo cùng nhau trên 8 nhiệm vụ xử lý giọng nói, hình ảnh và văn bản khác nhau bao gồm classi-tion, hình ảnh captioning và dịch thuật. mã hóa cụ thể phương thức đã được sử dụng để xử lý văn bản, hình ảnh, âm thanh và dữ liệu thể loại, trong khi phần còn lại của các thông số mạng được chia sẻ trên các nhiệm vụ. proposed “ ”, describing a method for the incremental training of an increasingly general problem solver. proposed controllable multi-task language models that can be directed according to language domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. (Reed & De Freitas, 2016) (Hochreiter & Schmidhuber, 1997) Kaiser et al. (2017) Thúy Thúy (2018) one big net for everything Keskar et al. (2019) của In this discussion, it is important to distinguish between one single multi-task network architecture versus one single neural network with the same weights for all tasks. Several poplar RL agents achieve good multi-task RL results within single domains such as Atari57 and DMLab However, it is much more common to use the same policy architecture and hyper-parameters across tasks, but the policy parameters are different in each task Điều này cũng đúng với các phương pháp RL hiện đại được áp dụng cho các trò chơi bảng. Moreover, this choice has been adopted by off-line RL benchmarks and recent works on large sequence neural networks for control, including decision transformers and the Trajectory Transformer of In contrast, in this work we learn a single network with the same weights across a diverse set of tasks. (Espeholt et al., 2018; Song et al., 2020; Hessel et al., 2019). (Mnih et al., 2015; Tassa et al., 2018). (Schrittwieser et al., 2020). (Gulcehre et al., năm 2020; Fu et al., 2020) (Chen et al., 2021b; Reid et al., 2022; Zheng et al., năm 2022) Janner et al. (2021). Các bài báo vị trí gần đây ủng hộ các mô hình tổng quát cao, đặc biệt là proposing one big net for everything, and on foundation models. However, to our knowledge there has not yet been reported a single generalist trained on hundreds of vision, language and control tasks using modern transformer networks at scale. Schmidhuber (2018) Bommasani et al. (2021) Các mô hình theo phong cách “một não” có mối liên hệ thú vị với khoa học thần kinh. famously stated that “ ”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence Mountcastle (1978) the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex (Hawkins & Blakeslee, 2004). Sensory substitution provides another argument for a single model For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect. (Bach-y Rita & Kercel, 2003). Our work is based on deep autoregressive models, which have a long history and can be found in generative models of text, images, video and audio. Combining autoregressive generation with transformers (V đã có tác động rất lớn trong mô hình ngôn ngữ protein folding vision-language models (T code generation dialogue systems with retrieval capabilities speech recognition neural machine translation and more , Recently researchers have explored task decomposition and grounding with language models aswani et al., 2017; Devlin et al., 2018) (Brown et al., năm 2020; Rae et al., 2021), (Jumper et al., 2021), simpoukelli et al., 2021; Wang et al., năm 2021; Alayrac et al., 2022), (Chen et al., năm 2021; Li et al., 2022b), (Nakano et al., 2021; Thoppilan et al., 2022), (Pratap và al, 2020), (Johnson et al., 2019) (Bommasani et al. 2021). (Người Việt và al., 2022; Ahn et al., 2022). construct a control architecture, consisting of a sequence tokenizer, a pretrained language model and a task-specific feed-forward network. They apply it to VirtualHome and BabyAI tasks, and find that the inclusion of the pretrained language model improves generalisation to novel tasks. Similarly, demonstrate that vision models pretrained with self-supervised learning, especially crop segmentations and momentum contrast can be effectively incorporated into control policies. Li et al. (2022a) Parisi et al. (2022) của tôi (He et al., 2020), As mentioned earlier, transfer in Atari is challenging. researched transfer between ran-domly selected Atari games. They found that Atari is a difficult domain for transfer because of pronounced differences in the visuals, controls and strategy among the different games. Further difficulties that arise when applying behaviour cloning to video games like Atari are discussed by Rusu et al. (2016) Kanervisto et al. 2020 There has been great recent interest in data-driven robotics However, note that in robotics “ ”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data. (Cabi et al., năm 2019; Chen et al., 2021a). Bommasani et al. (2021) the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments Generating actions using an autoregressive model can lead to causal “self-delusion” biases when there are confounding variables Ví dụ, hành động lấy mẫu có thể điều kiện cho mô hình để giải quyết nhiệm vụ sai khi nhiều nhiệm vụ chia sẻ quan sát tương tự và thông số kỹ thuật hành động. Chúng tôi sử dụng kỹ thuật nhanh trong các nhiệm vụ mơ hồ, điều kiện mô hình của chúng tôi cho một cuộc biểu tình thành công. Điều này ngăn chặn các biến số gây nhầm lẫn, giảm sự tự nhầm lẫn. Một giải pháp khác mà chúng tôi đã không khám phá trong công việc này là sử dụng giảng dạy chống thực tế, nơi chúng tôi đào tạo một mô hình trực tuyến bằng cách sử dụng phản hồi ngay lập tức của chuyên gia. (Ortega et al., năm 2021). 2, 7 Broader Impact Although generalist agents are still only an emerging area of research, their potential impact on society calls for a thorough interdisciplinary analysis of their risks and benefits. For the sake of transparency, we document the intended use cases of Gato in the model card in Appendix However, the tools for mitigating harms of generalist agents are relatively underdeveloped, and require further research before these agents are deployed. A. Vì đại lý tổng quát của chúng tôi có thể hoạt động như một mô hình ngôn ngữ tầm nhìn, nó thừa hưởng những mối quan tâm tương tự như đã thảo luận trong In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. (Wei-dinger et al., 2021; Bommasani et al., năm 2021; Rae et al., 2021; Alayrac và al. 2022). An toàn kỹ thuật AGI may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (R are especially important for the design of human-compatible generalist agents. It may be possible to extend some of the value alignment approaches for language to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight This limitation underscores the need for a careful design and a deployment process that incorporates multiple disciplines and viewpoints. (Bostrom, 2017) Tác dụng, 2019) (Ouyang et al., năm 2022; Kenton et al., 2021) (Amodei et al., 2016). Understanding how the models process information, and any emergent capabilities, requires significant ex-perimentation. External retrieval has been shown to improve both interpretability and performance, and hence should be consid-ered in future designs of generalist agents. (Borgeaud et al., 2021; Menick et al., 2022; Nakano et al., 2021; Thoppilan et al., 2022) Although still at the proof-of-concept stage, the recent progress in generalist models suggests that safety researchers, ethicists, and most importantly, the general public, should consider their risks and benefits. We are not currently deploying Gato to any users, and so anticipate no immediate societal impact. However, given their potential impact, generalist models should be developed thoughtfully and deployed in a way that promotes the health and vitality of humanity. 8 Limitations and Future work 8.1 RL data collection Gato là một cách tiếp cận dựa trên dữ liệu, vì nó bắt nguồn từ việc học bắt chước. Trong khi các tập dữ liệu ngôn ngữ tự nhiên hoặc hình ảnh tương đối dễ dàng để có được từ web, một tập dữ liệu quy mô web cho các nhiệm vụ kiểm soát hiện không có sẵn. That being said, there has already been extensive investigation into this issue. Offline RL aims at leveraging existing control datasets, and its increasing popularity has already resulted in the availability of more diverse and larger datasets. Richer environments and simulations are being built (e.g. Metaverse), and increasing numbers of users already interact with them among thousands of already deployed online games (e.g. there exists a large dataset of Starcraft 2 games). Real-life data has also been already stored for ML research purposes; for example, data for training self-driving cars is acquired from recording human driver data. Finally, while Gato uses data consisting of both observations and corresponding actions, the possibility of using large scale observation-only data to enhance agents has been already studied (Baker et al., 2022). Thanks to online video sharing and streaming platforms such as Youtube and Twitch, observation-only datasets are not significantly more difficult to collect than natural language datasets, motivating a future research direction to extend Gato to learn from web data. While the previous paragraph focuses on alleviating drawbacks of data collection from RL agents, it is important to note that this approach presents a different set of tradeoffs compared to scraping web data and can be actually more practical in some situations. Once the simulation is set up and near SOTA agent trained, it can be used to generate massive amounts of high quality data. That is in contrast to the quality of web data which is notorious for its low quality. In short, we believe that acquiring suitable data is another research question on its own, and this is an active area of research with growing momentum and importance. 8.2 Prompt and short context Gato được yêu cầu với một bản trình diễn chuyên gia, giúp người đại diện đưa ra các hành động tương ứng với nhiệm vụ được cung cấp. Điều này đặc biệt hữu ích vì nếu không thì không có mã định danh nhiệm vụ có sẵn cho người đại diện (đó là trái ngược với nhiều thiết lập RL đa nhiệm vụ). Gato rút ra nhiệm vụ có liên quan từ các quan sát và hành động trong lời nhắc. Tuy nhiên, chiều dài bối cảnh của đại lý của chúng tôi được giới hạn ở mức 1024 token mà chuyển sang đại lý đôi khi tham dự chỉ một vài bước thời gian môi trường tổng cộng.Đây là đặc biệt trường hợp cho các môi trường với quan sát hình ảnh, nơi tùy thuộc vào độ phân giải mỗi quan sát có thể dẫn đến hơn một trăm token mỗi. Do bối cảnh nhanh hạn chế này, các thí nghiệm sơ bộ với các cấu trúc nhanh khác nhau dẫn đến hiệu suất rất tương tự. tương tự như vậy, các đánh giá sớm của mô hình sử dụng học tập trong bối cảnh dựa trên nhanh trên các môi trường mới không cho thấy cải thiện hiệu suất đáng kể so với đánh giá không nhanh trong cùng một thiết lập. Do đó, chiều dài bối cảnh là một giới hạn hiện tại của kiến trúc của chúng tôi, chủ yếu là do quy mô vuông của sự tự chú ý. Nhiều kiến trúc được đề xuất gần đây cho phép một bối cảnh dài hơn với hiệu quả cao hơn và những đổi mới này có thể cải thiện hiệu suất đại lý của chúng tôi. 9 Conclusions Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. In the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch. Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, we can build a useful general-purpose agent. Công nhận We would like to thank Dan Horgan, Manuel Kroiss, Mantas Pajarskas, and Thibault Sottiaux for their help with data storage infrastructure; Jean-Baptiste Lespiau and Fan Yang for help on concurrent evalua-tion; Joel Veness for advising on the model design; Koray Kavukcuoglu for helping inspire the project and facilitating feedback; Tom Erez for advising on the agent design and task selection for continuous control; Igor Babuschkin for helping code the initial prototype; Jack Rae for advising on the transformer language model codebase; Thomas Lampe for building robot infrastructure and advising on real robotics experiments; Boxi Wu for input on ethics and safety considerations; Pedro A. Ortega for advice in regard to causality and self-delusion biases. Author Contributions developed the project concept, wrote the initial prototype, and led the project overall. led architecture development for vision and text, built infrastructure for tokenization and prompting, and contributed heavily to overall agent development and evaluation. Scott Reed Konrad Żołna dẫn đầu công việc về tối ưu hóa kiến trúc biến áp, chạy số lượng lớn nhất của experi-ments, và phân tích các thuộc tính quy mô pháp luật và hiệu suất đại lý trong phân phối. Emilio Parisotto là nhà lãnh đạo kỹ thuật, chịu trách nhiệm tạo ra một bộ tải dữ liệu có thể mở rộng và bộ đánh giá hỗ trợ hàng trăm nhiệm vụ cùng một lúc, và cho việc tích hợp robot ban đầu với Gato. Sergio Gómez Colmenarejo developed the model including the sampler for the initial prototype, carried out ex-periments focusing on robotics, and created visualizations. Alexander Novikov built scalable storage infrastructure to provide Gato with SoTA-level agent expe-rience in Atari and other domains. Gabriel Barth-Maron conducted large scale agent data collection, built substantial data loading infrastructure, and integrated large scale visual-language datasets into the training of Gato. Mai Giménez đóng góp rộng rãi cho cơ sở mã Gato bao gồm một bộ tải phần mềm đào tạo phân tán tùy chỉnh, và dẫn đầu sự phát triển của các tiêu chuẩn để phổ biến bên ngoài phân phối, và đào tạo các đại lý cơ bản cạnh tranh. Yury Sulsky supported physical robotics infrastructure, conducted numerous evaluations and experiments to analyze the generalization properties of Gato, and contemplated broader ethical impact. Jackie Kay guided Gato’s deployment to the physical robot, provided strong existing base-lines for block stacking, and advised on model development and experimental design. Jost Tobias Springenberg developed the Gato dialogue and image captioning demonstrations, allowing users to easily probe the vision and language capacities of agents in development. Tom Eccles contributed to agent design as well as control datasets and environments with randomized physics and morphology variations. Jake Bruce helped in exploring vision architectures. Ali Razavi contributed to the first prototype of Gato that worked on Atari, in addition to exploring alternative network architectures and training objectives. Ashley Edwards advised on agent design, experiment design and task selection, especially for continuous control applications. Nicolas Heess tư vấn về thiết kế mô hình và thí nghiệm, và cung cấp phản hồi trong các cuộc họp thường xuyên. Yutian Chen advised on the design and planning of robotics efforts. Raia Hadsell advised on all aspects of the project, especially model architecture, training strategies and benchmark design. Oriol Vinyals là người quản lý dự án chính; đưa ra các mục tiêu chính, theo dõi sự tiến bộ, tạo điều kiện cho các tuyên bố trước và phản hồi, và phối hợp lập kế hoạch tài nguyên. Mahyar Bordbar oversaw the project from its inception. Nando de Freitas References Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation. , 2018. Preprint arXiv:1806.06920 Samira Abnar và Willem Zuidema. định lượng lưu lượng chú ý trong máy biến áp. , 2020. Preprint arXiv:2005.00928 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. năm 2022. Preprint arXiv:2204.01691 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. , 2022. Preprint arXiv:2204.14198 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. , 2016. Preprint arXiv:1606.06565 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In , pp. 2425–2433, 2015. International Conference on Computer Vision Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. , 2016. Preprint arXiv:1607.06450 Paul Bach-y Rita and Stephen W Kercel. Sensory substitution and the human-machine interface. , 7(12):541-546 năm 2003. Trends in cognitive sciences Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. , 2022. Preprint arXiv::2206.11795 Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. năm 2018. Preprint arXiv:1804.08617 Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. , 2016. Preprint arXiv:1612.03801 Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. , 47:253–279, năm 2013. Journal of Artificial Intelligence Research Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. , 2021. Preprint arXiv:2108.07258 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. , 2021. Preprint arXiv:2112.04426 Nick Bostrom. Đô thị, 2017 Superintelligence Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. năm 2016. Preprint arXiv:1606.01540 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Các mô hình ngôn ngữ là những người học ít. , pp. 1877–1901, 2020. Advances in Neural Information Processing Systems Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. , 2019. Preprint arXiv:1909.12200 Annie S Chen, Suraj Nair, và Chelsea Finn. Học các chức năng phần thưởng robot tổng quát từ các video con người “trong hoang dã”. , 2021a. Preprint arXiv:2103.16817 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar-avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. , 34, 2021b. Advances in Neural Information Processing Systems Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. , 2021c. Preprint arXiv:2107.03374 Tao Chen, Adithyavairavan Murali, and Abhinav Gupta. Hardware conditioned policies for multi-robot transfer learning. Ngày 31, 2018 Tiến bộ trong hệ thống xử lý thông tin thần kinh Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In , 2022. ICLR Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. , 2015. Preprint arXiv:1504.00325 Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. , 2018. Preprint arXiv:1810.08272 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. , 2022. Preprint arXiv:2204.02311 Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In , pp. 2048–2056, 2020. International Conference on Machine Learning Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In , pp. 2978–2988, 2019. Annual Meeting of the Association for Computational Linguistics Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, và Sergey Levine. Học các chính sách mạng thần kinh mô-đun cho chuyển giao đa nhiệm vụ và đa robot. , pp. 2169–2176, 2017. IEEE International Conference on Robotics & Automation Jacob Devlin, Ming-Wei Chang, Kenton Lee, và Kristina Toutanova. BERT: Đào tạo trước các nhà biến đổi hai chiều sâu để hiểu ngôn ngữ. , 2018. Preprint arXiv:1810.04805 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. Một hình ảnh có giá trị 16x16 từ: Transformers cho nhận dạng hình ảnh ở quy mô. , 2020. Preprint arXiv:2010.11929 Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. In , pp. 1407–1416, 2018. International Conference on Machine Learning Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. năm 2020. Preprint arXiv:2004.07219 Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. , 2021. Preprint arXiv:2111.10364 Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. , 33:7248–7259, 2020. Advances in Neural Information Processing Systems Jeff Hawkins and Sandra Blakeslee. . Macmillan, 2004. On intelligence Kaiming He, Xiangyu Zhang, Shaoqing Ren, và Jian Sun. Học tập dư thừa sâu để nhận dạng hình ảnh. , trang 770–778, 2016a. IEEE Computer Vision and Pattern Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In , trang 630–645, 2016b. Hội nghị Châu Âu về Tầm nhìn Máy tính Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In , pp. 9729–9738, 2020. IEEE Computer Vision and Pattern Recognition Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). , 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Preprint arXiv:1606.08415 Multi-task deep reinforcement learning with popart. In , 2019. AAAI Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization. năm 2021. Preprint arXiv:2104.06159 Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. , 9(8):1735–1780, 1997. Neural computation Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. năm 2022. Preprint arXiv:2203.15556 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, và Kilian Weinberger. Mạng lưới sâu với độ sâu stochastic. , 2016. Preprint arXiv:1603.09382 Wenlong Huang, Igor Mordatch, và Deepak Pathak. Một chính sách để kiểm soát tất cả chúng: Các chính sách mô-đun được chia sẻ để kiểm soát agent-agnostic. , pp. 4455–4464, 2020. Hội nghị quốc tế về Machine Learning Wenlong Huang, Pieter Abbeel, Deepak Pathak, và Igor Mordatch. mô hình ngôn ngữ như lập kế hoạch không ảnh hưởng: Trích xuất kiến thức có thể hành động cho các đại lý thể hiện. , 2022. Preprint arXiv:2201.07207 David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, và Yoshua Bengio. , 2020. Preprint arXiv:2007.12770 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. , 2021. Preprint arXiv:2107.14795 Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. , 34, 2021. Advances in Neural Information Processing Systems Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, và Tom Duerig. Mở rộng khả năng biểu diễn ngôn ngữ thị giác và thị giác với sự giám sát văn bản ồn ào. , trang 4904–4916, năm 2021. Hội nghị quốc tế về Machine Learning Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In , pp. 3874–3884, 2019. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Dự đoán cấu trúc protein rất chính xác với AlphaFold. , 596(7873):583–589, năm 2021. Nature Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. , 2017. Preprint arXiv:1706.05137 Anssi Kanervisto, Joonas Pussinen, and Ville Hautamäki. Benchmarking end-to-end behavioural cloning on video games. In , pp. 558–565, 2020. IEEE conference on games (CoG) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. , 2020. Preprint arXiv:2001.08361 Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In , 2018. International Conference on Learning Representations Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik và Geoffrey Irving. , 2021. Preprint arXiv:2103.14659 Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, và Richard Socher. CTRL: Một mô hình ngôn ngữ biến áp có điều kiện cho thế hệ điều khiển. , 2019. Preprint arXiv:1909.05858 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. năm 2014. Preprint arXiv 1412.6980 Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In , Hội nghị thường niên của Hiệp hội Ngôn ngữ học Máy tính pp. 66–71, 2018. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, và Shimon Whiteson. cơ thể của tôi là một cái lồng: vai trò của hình thái học trong kiểm soát không tương thích dựa trên đồ thị. , 2020. Preprint arXiv:2010.01856 Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis, Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In năm 2021. Hội nghị về Robot Learning Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol-maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation. , 2022. Preprint arXiv:2205.03353 Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making. , 2022a. Preprint arXiv:2202.01771 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Tạo mã cấp cạnh tranh với AlphaCode. , 2022b. Preprint arXiv:2203.07814 Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. , 2017. Preprint arXiv:1711.05101 Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A visual question answering benchmark requiring external knowledge. In ,pp. 3195–3204, 2019. IEEE Computer Vision and Pattern Recognition Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. , 2022. Preprint arXiv:2203.11147 Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In , pp. 220–229, 2019. Proceedings of the conference on fairness, accountability, and transparency Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. , 518(7540):529–533, 2015. Nature Vernon Mountcastle. An organizing principle for cerebral function: the unit module and the distributed system. , 1978. The mindful brain Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. , 2021. Preprint arXiv:2112.09332 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. năm 2016. Preprint arXiv:1609.03499 Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Xoa bóp nền tảng: ảo tưởng trong các mô hình chuỗi để tương tác và kiểm soát. năm 2021. Preprint arXiv: 2110.10819 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. năm 2022. Preprint arXiv:2203.02155 Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec-tiveness of pre-trained vision models for control. , 2022. Preprint arXiv:2203.03580 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. năm 2020. Preprint arXiv:2007.03001 Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. , 30, 2017. Advances in Neural Information Processing Systems Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. năm 2021. Preprint arXiv:2112.11446 Scott Reed and Nando De Freitas. Neural programmer-interpreters. In , 2016. International Conference on Learning Representations Machel Reid, Yutaro Yamada, và Shixiang Shane Gu. Wikipedia có thể giúp học tập tăng cường ngoại tuyến không? , 2022. Preprint arXiv:2201.12122 Stuart Russell. Penguin, 2019 Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Human compatible: Artificial intelligence and the problem of control Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. , 2016. Preprint arXiv:1606.04671 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. Multitask prompted training enables zero-shot task generalization. In , 2022. Hội nghị quốc tế về đại diện học tập Jürgen Schmidhuber. One big net for everything. , 2018. Preprint arXiv:1802.08864 Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. , 588(7839):604–609, 2020. Nature Piyush Sharma, Nan Ding, Sebastian Goodman, và Radu Soricut. Chữ ký khái niệm: Một bộ dữ liệu alt-text hình ảnh được làm sạch, siêu mỏng để tự động ghi chú hình ảnh. , trang 2556–2565, 2018. Hội nghị thường niên của Hiệp hội Ngôn ngữ học Máy tính Noam Shazeer. biến thể Glu cải thiện biến tần. năm 2020. Preprint arXiv::2002.05202 H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control. , 2020. ICLR Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. , 15(56): 1929–1958, 2014 Journal of Machine Learning Research Richard Sutton - Bài học cay đắng , 13:12, 2019. Incomplete Ideas (blog) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. , 2018. Preprint arXiv:1801.00690 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. , 2022. Preprint arXiv:2201.08239 Emanuel Todorov, Tom Erez, và Yuval Tassa. Mujoco: Một động cơ vật lý cho điều khiển dựa trên mô hình. , pp. 5026–5033, 2012. International Conference on Intelligent Robots and Systems Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. , pp. 200–212, 2021. Advances in Neural Information Processing Systems Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. , 6:100022, 2020. Software Impacts Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser và Illia Polosukhin. Chú ý là tất cả những gì bạn cần. , 30, 2017. Advances in Neural Information Processing Systems Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, và Yuan Cao. Simvlm: Mô hình ngôn ngữ trực quan đơn giản với sự giám sát yếu. năm 2021. Preprint arXiv:2108.10904 Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. , 33:7768–7778, 2020 Tiến bộ trong hệ thống xử lý thông tin thần kinh Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. , 2021. Preprint arXiv:2109.01652 Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. , 2021. Preprint arXiv:2112.04359 Yuxin Wu and Kaiming He. Group normalization. In , pp. 3–19, 2018. European Conference on Computer Vision Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In , pp. 1094–1100, 2020. Conference on Robot Learning Qinqing Zheng, Amy Zhang, và Aditya Grover. quyết định chuyển đổi trực tuyến. năm 2022. Preprint arXiv:2202.05607 Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. , 2020. Preprint arXiv:2011.13885 Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gómez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial imitation learning. In , trang 247–263, năm 2021. Conference on Robot Learning Tài liệu bổ sung A Model card Chúng tôi trình bày một mẫu thẻ cho mèo trong bàn 4. Table 4: Chúng tôi tuân theo khuôn khổ được đề xuất trong Gato Model Card. (Mitchell et al., 2019). B Agent Data Tokenization Details In this section we provide additional details on our tokenization schemes. Our agent data is sequenced as follows: • are presented to the agent in order of time (timesteps). Episodes • in turn are presented in the following order: Timesteps ([ 1 : 1: 1: ]) are ordered lexicographically by key, each item is sequenced as follows: – Observations y k, x m, z n ∗ Text tokens ( 1: ) are in the same order as the raw input text. y k ∗ Image patch tokens ( 1: ) are in raster order. x m ∗ Tensors ( 1: ) (such as discrete and continuous observations) are in row-major order. z n - Chị (' '); một token phân tách được chỉ định được cung cấp sau khi quan sát. Separator | – (Những 1: ) are tokenized as discrete or continuous values and in row-major order. Actions a A A full sequence of tokens is thus given as the concatenation of data from T timesteps: where L = T(k + m + n + 1 + A) is the total number of tokens. Mỗi yếu tố điểm nổi của các tensor trong chuỗi quan sát là mu-law companded như trong WaveNet (Oord et al., Năm 2016 : with parameters µ = 100 and M = 256. (If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range \[ 1, 1\] for all our environments.) All the elements are subsequently clipped so that they fall in the set \[ 1, 1\]. Finally, they are discretized using bins of uniform width on the domain \[ 1,1\]. We use 1024 bins and shift the resulting integers so they are not overlapping with the ones used for text tokens. The tokenized result is therefore a sequence of integers within the range of \[32000, 33024). See Figure and Figure for visualizations of tokenizing and sequencing values (both discrete and con-tinuous) and images. See Section for details about local position encodings referenced in the figures. 14 15 C C Model Architecture C.1 Transformer Hyperparameters The transformer hyperparameters of Gato are presented in Table We also list the hyperparameters of smaller architecture variants used in Section 5. 5. C.2 Embedding Function The ResNet block uses the v2 architecture contains GroupNorm with 32 groups instead of LayerNorm và gel activation functions instead of RELU. The block is diagrammed in Figure (He et al., 2016b), (Tôi và anh ấy, 2018) (Ba et al., 2016), (Hendrycks & Gimpel, 2016) 16. C.3 Position Encodings After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. Patch vị trí mã hóa These position encodings convey information about a patch’s global position within the image from which the patch was extracted. First, the relative row and column intervals of the patch are calculated by normalizing the patch’s pixel intervals by the image resolution. The row and column normalized intervals are then quantized into a vocabulary size (we use 128) and are used to index a row and column table of learnable position encodings. The method in which the quantized row and column intervals are converted into indices depends on whether we are training or evaluating the model: during training a random index is uniformly sampled from the quantized interval, while during evaluation we deterministically take the (rounded) mean of the interval. Once row and column position encoding are retrieved from the embedding table, they are added onto the token embedding produced by the resnet embedding function, as described previously. To more concretely demonstrate this process, we provide an example in Figure [17.](#_bookmark144) We will follow the process with the patch highlighted in red on the left of the subfigure. The image is of resolution 80 64 and each patch is 16 16, meaning there are 5 4 = 20 patches total. The highlighted patch starts at pixel row interval \[16*,* 32\] and pixel column interval \[32*,* 64\]. Normalized, the row interval is therefore \[0*.*25*,* 0*.*5\] and the column interval is \[0*.*4*,* 0*.*6\]. We then separately quantize the intervals into 128 uniformly spaced bins, with the resulting quantized row interval being \[32*,* 64\] and the quantized column interval being \[51*,* 77\]. During training, we uniformly sample integers between the quantized row intervals, whereas during testing we would use the means, which are index 48 for row position and index 64 for column position. The row and column positions are finally used to index separate row and column position encoding tables to produce learnable embeddings which are added onto the corresponding patch token embedding. Local Observation Position Encodings The local observation position encoding adds positional information about where observation tokens are positioned within the local time-step they were an element of. First, we reiterate that, during tokenization, for each time-step all elements of the observation set are tokenized into sequences and concatenated into an observation sequence. Each token in this observation sequence is given an index which corresponds to the sequence order, i.e. the first token is 0 and the last is the length of the observation sequence minus one. After embedding, for any tokens that were a part of an observation set, the corresponding observation token index is used to index an embedding table of learnable position encodings, with one embedding for every possible observation token index (in practice we simply set the table size to a large value like 512). / The position encoding is then added onto the observation token embedding to produce the final token embedding. Note that all action tokens are given the same position encoding regardless of their position in the time-step sequence. We illustrate an example of this process in Figure 18. D. Thiết lập For all models we use the AdamW optimizer with a linear warm-up and cosine schedule decay. The linear warmup lasts for 15*,* 000 steps, starting from a learning rate of 1e-7 and ending at a different maximum learning rate depending on the model (see Table Tỷ lệ học tập này sau đó là cosine phân hủy bởi một yếu tố 10x trên 1.000.000 bước. 1 = 0*. 2 = 0.*95 and Chúng tôi sử dụng kích thước lô 512 và chiều dài chuỗi 1024 token cho tất cả các mô hình. Optimizer: (Loshchilov & Hutter, năm 2017 6). β 9 VĐV ϵ We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. Regularization: (Người Việt và al., năm 2016) E Fine-tuning Setup Đối với tất cả các mô hình chúng ta sử dụng Adam optimizer với tỷ lệ học tập liên tục 1e-5. Adam optimizer có các thông số 1 = 0*. 2 = 0.*95 and Chúng tôi sử dụng kích thước lô 64 và chiều dài chuỗi 1024 token cho tất cả các mô hình. Optimizer: (Kingma & Ba, 2014) β 9, β ϵ We use dropout with a rate of 0.1. Regularization: (Srivastava et al., năm 2014) Chúng tôi đánh giá đại lý mỗi 100 bước học tập. Mỗi đánh giá báo cáo trung bình 10 chạy của một điểm kiểm tra nhất định. Trung bình di chuyển của 5 điểm như vậy được tính toán (để thu thập 50 chạy cùng nhau). Hiệu suất điều chỉnh tinh tế cuối cùng được xác định là tối đa của các điểm mịn này. Evaluation: We generated data for the fine-tuning tasks the same way we did for the other tasks (see Section 3.1 for details). Instead of using all the data for a fine-tuning task, we discarded all but 2000 best episodes (leading to the highest returns). The fine-tuning datasets were created in the following way. We randomly took 1000 episodes (out of 2000 preselected episodes), then a subset of 100 episodes from the selected episodes, then 10, 5, 3, and finally a single episode. We repeated this procedure 3 times to obtain 3 series of cascading subsets for each task. Each subset is used to conduct one fine-tuning experiment, and each is reported on our plots in Section as a separate point. Datasets: 5.2 Chúng tôi đã không thay đổi bất kỳ nhiệm vụ nào và sử dụng các phiên bản canon của họ. Vì 3 trong số 4 nhiệm vụ là nguồn mở, chúng không cần giải thích thêm. Đối với nhiệm vụ thứ tư, DMLab order_of_apples_forage_simple, mục tiêu là thu thập táo theo thứ tự đúng, những thứ màu xanh lá cây đầu tiên sau đó là vàng. Task settings: F Data Collection Details Đánh giá F1 Atari We collect two separate sets of Atari environments. The first (that we refer to as ALE Atari) consists of 51 canonical games from the Arcade Learning Environment The second (that we refer to as ALE Atari Extended) is a set of alternative games with their game mode and difficulty randomly set at the beginning of each episode. (Bellemare et al., 2013). 3 Đối với mỗi môi trường trong các bộ này, chúng tôi thu thập dữ liệu bằng cách huấn luyện một Muesli agent cho 200M tổng số bước môi trường. chúng tôi ghi lại khoảng 20.000 tập ngẫu nhiên được tạo ra bởi agent trong quá trình đào tạo. (Hessel et al., năm 2021) F2 Sokoban Sokoban là vấn đề lập kế hoạch trong đó đại lý phải đẩy các hộp đến các vị trí mục tiêu. Một số động thái là không thể đảo ngược và do đó sai lầm có thể làm cho câu đố không thể giải quyết được. Kế hoạch trước thời gian là do đó cần thiết để thành công trong câu đố này. Chúng tôi sử dụng một Muesli agent to collect training data. (Racanière et al., 2017), (Hessel et al., năm 2021) F3 Đứa bé BabyAI is a gridworld environment whose levels consist of instruction-following tasks that are described by a synthetic language. We generate data for these levels with the built-in BabyAI bot. The bot has access to extra information which is used to execute optimal solutions, see Section C in the appendix of for more details about the bot. We collect 100,000 episodes for each level. (Chủ tịch Nguyễn Thị Trọng et al., năm 2018 F.4 DeepMind Control Suite Bộ điều khiển DeepMind (T) . , is a set of physics-based simulation environments. For each task in the control suite we collect two disjoint sets of data, one using only state features and another using only pixels. We use a D4PG agent to collect data from tasks with state features, and an MPO Cơ sở dữ liệu để thu thập dữ liệu bằng cách sử dụng pixel. Nguyễn Thị Hương et al 2020; Tassa et al., năm 2018 (Barth-Maron et al. và những người khác) năm 2018 (Abdolmaleki et al., 2018) We also collect data for randomized versions of the control suite tasks with a D4PG agent. These versions randomize the actuator gear, joint range, stiffness, and damping, and geom size and density. There are two difficulty settings for the randomized versions. The small setting scales values by a random number sampled from the union of intervals [0*. * 0 * 0 [ 1 ] ,* 1*. . * 0 * 0 [1. ,* 1*.*4]. 9 95) 05 1]. The large setting scales values by a random number sampled from the union of intervals [0 6 8] 2 F.5 DeepMind Lab Phòng thí nghiệm DeepMind , is a first-person 3D environment designed to teach agents 3D vision from raw pixel inputs with an egocentric viewpoint, navigation, and planning. (Beattie et al. năm 2016) We trained an IMPALA Các đại lý cùng nhau trên một tập hợp 18 cấp độ DM Lab cha mẹ tạo ra các bản đồ theo quy trình cho mỗi tập mới.Dữ liệu được thu thập bằng cách thực hiện đại lý trên 18 cấp độ này, cũng như một tập hợp bổ sung 237 cấp độ được làm bằng tay để kiểm tra một tập hợp các kỹ năng đa dạng. (Espeholt et al., năm 2018 18 cấp bậc cha mẹ được đặc trưng bởi sự đa dạng cao của các bản đồ được tạo ra. Sự khác biệt giữa các cấp độ bắt nguồn từ các siêu tham số được sử dụng trong quá trình tạo ra. Các siêu tham số này kiểm soát các đặc điểm cấp cao như các loại cấu trúc được sinh ra, khó khăn của hướng dẫn ngôn ngữ, hoặc sự hiện diện của các công cụ cụ thể. Các cấp bậc cha mẹ đã được phát triển để cải thiện hiệu suất của các đại lý RL được đào tạo trực tuyến trên chúng. Không giống như các cấp bậc mẹ, mỗi cấp độ 237 thủ công bổ sung sử dụng gần như cùng một bản đồ, và sự khác biệt chính giữa các trường hợp của bản đồ cùng một cấp độ là thẩm mỹ như màu sắc của tường hoặc điều kiện chiếu sáng. procedurally generated and were designed to test a diverse set of skills such as walking up stairs or using specific tools. They are similar to levels presented in Figure 3, Figure 7 and Figure 8 in aforementioned paper by Không Beattie và al. (2016). Additional information on the 18 parent levels (and their relation to the other levels) is presnted in details in the NeurIPS Workshop talk bởi Daniel Tanis . A Methodology for RL Environment Research 4 Tổng cộng, chúng tôi đã thu thập dữ liệu cho 255 cấp độ từ DeepMind Lab (18 cấp độ mẹ và 237 cấp độ thủ công), 254 trong số đó được sử dụng trong quá trình đào tạo Gato. F6 Procgen Điểm chuẩn Procgen is a suite of 16 procedurally generated Atari-like environments, which was proposed to benchmark sample efficiency and generalization in reinforcement learning. Data collection was done while training a R2D2 agent on each of the environments. We used the hard difficulty setting for all environments except for maze and heist, which we set to easy. (Thuyết minh và al. 2020) (Kapturowski et al. và những người khác) năm 2018 F.7 Modular RL mô-đun Module RL is a collection of MuJoCo (T based continuous control environments, composed of three sets of variants of the OpenAI Gym Walker2d-v2, Humanoid-v2, và Hopper-v2. Mỗi biến thể là một sửa đổi hình thái của cơ thể ban đầu: tập hợp các hình thái được tạo ra bằng cách liệt kê tất cả các tiểu tập thể có thể của chi, và giữ chỉ những tập hợp mà a) chứa thân, và b) vẫn còn hình thành một biểu đồ kết nối. Điều này dẫn đến một tập hợp các biến thể với các kích thước đầu vào và đầu ra khác nhau, cũng như động lực khác nhau so với các hình thái ban đầu. Chúng tôi thu thập dữ liệu bằng cách đào tạo một đại lý D4PG cụ thể về hình thái trên mỗi biến thể cho tổng số 140M bước diễn viên, điều này được thực hiện cho 30 hạt giống ngẫu nhiên cho mỗi biến thể. (Người Việt và al., Năm 2020) odorov et al., năm 2012) (Brockman et al., năm 2016) F.8 DeepMind Manipulation sân chơi Trò chơi DeepMind Manipulation Playground là một bộ các nhiệm vụ robot mô phỏng dựa trên MuJoCo. Chúng tôi thu thập dữ liệu cho 4 nhiệm vụ Jaco (hộp, đống chuối, chèn và trượt) bằng cách sử dụng một đại lý Regression Critic-Regularized (CRR) Dữ liệu được thu thập bao gồm trạng thái vật lý của MuJoCo, mà chúng tôi sử dụng để đào tạo và đánh giá Gato. (Thuyết minh và al. năm 2021) (Nguyễn và al. 2020) F.9 Meta-Thế giới Meta-World (và is a suite of environments để so sánh học tập tăng cường meta và học tập đa nhiệm vụ.Chúng tôi thu thập dữ liệu từ tất cả các nhiệm vụ đào tạo và thử nghiệm trong chế độ MT50 bằng cách đào tạo một đại lý MPO với hạt giống môi trường không giới hạn và với quyền truy cập vào trạng thái của động cơ vật lý MuJoCo. dữ liệu được thu thập cũng chứa trạng thái của động cơ vật lý MuJoCo. u et al., Năm 2020) 5 (Thanh Hùng và al., năm 2018 G Đánh giá robot thực tế chi tiết Trong thế giới thực, kiểm soát là không đồng bộ; Vật lý không chờ đợi các phép tính để hoàn thành. Do đó, độ trễ suy luận là một mối quan tâm để đánh giá một mô hình lớn cho các nhiệm vụ thế giới thực. Trong robot, tốc độ kiểm soát nhanh được cho là rất quan trọng để phản ứng với các hiện tượng động. Thiết lập robot cho RGB tích lũy có tốc độ kiểm soát 20Hz (0.05 giây thời gian) theo thiết kế. Để đạt được một khoảng trễ chấp nhận được, chúng tôi đã sửa đổi kết luận trong thời gian đánh giá bằng cách rút ngắn chiều dài bối cảnh xuống còn 1. Chúng tôi cũng thực hiện một kế hoạch lấy mẫu song song, nơi tất cả các token hành động được rút ra khỏi các chuỗi đầu vào trong quá trình đào tạo, vì vậy chúng tôi có thể lấy mẫu tất cả các token tương ứng với hành động robot trong một bước suy Chúng tôi sử dụng chức năng phần thưởng ít được mô tả trong for data filtering. We only select trajectories with task success; that is, a sparse reward of 1 on the final timestep. Lee et al. ( 2021 ) Kết thúc H Skill Mastery architecture Các con số được báo cáo cho điểm chuẩn Skill Mastery đã được thu thập bằng cách thực hiện một mô hình zero-shot sử dụng phiên bản trước của kiến trúc Gato. Thay vì nhúng bản vá ResNet, một kiến trúc tương tự sử dụng bộ biến áp địa phương đã được sử dụng để nhúng mã thông báo bản vá hình ảnh. Các nhúng vị trí địa phương và nhúng vị trí bản vá đã không được sử dụng. Những thay đổi này đã được thực hiện và được tìm thấy để cải thiện hiệu suất của Gato sau khi dữ liệu đào tạo trước được thay đổi (vì chúng tôi quyết định tập trung vào Khả năng tổng quát thay vì thách thức Kỹ năng Mastery), đó là lý do tại sao chúng được trình bày như là kiến trúc cuối cùng của mô hình đầy đủ của chúng tôi. I Robot ablation bổ sung Chúng tôi đã tiến hành một loạt các ablations trong mô phỏng để hiểu rõ hơn về tác động của dữ liệu chuẩn bị đa dạng trong lĩnh vực robot (xem hình). Chúng tôi đã bao gồm các đường cơ bản giống như trong phần selecting the 364M parameter size variant, as well as an additional baseline trained with control suite data only. The DM Control-only agent is superior to the base Gato at zero-shot transfer and with a lot of fine-tuning data, suggesting that Gato may not be using the representations learned from the text-based datasets when adapting to robotics tasks. The same domain only agent performs the best overall, matching the CRR baseline at 1 fine-tuning episode and outperforming it with more data, suggesting that Gato at current scale can trade its generalization capacity for data-efficient and effective few-shot adaptation. 19 ) 5.2 Đánh giá J Attention visualization To render the transformer attention weights, we retrieved the cross-attention logits, a tensor with dimension ( ) nơi là số lượng đầu và is the number of tokens in a sequence. The ( (th entry of this matrix can be interpreted as the amount that head attends to token từ token Do hệ thống token hóa hình ảnh của Gato, có nhiều token mỗi timestep. Do đó, để thu hút sự chú ý cho một timestep cụ thể, chúng tôi đã lấy tiểu ma trận tương ứng với timestep đó. sau đó chúng tôi áp dụng một softmax trên các hàng của ma trận này để bình thường hóa các giá trị có liên quan. Bởi vì chúng tôi chỉ quan tâm đến sự chú ý đến các token trước đó, chúng tôi đã loại trừ đường chéo bằng cách đặt nó thành vô hạn âm trước softmax. H, T, T H T h, i, j h j i Để đo lường tầm quan trọng của mỗi bản vá, chúng tôi trung bình trọng lượng chú ý trên cột tương ứng. Bởi vì Gato sử dụng một biến đổi nhân quả, ma trận chú ý là hình tam giác thấp hơn, vì vậy trung bình chỉ được xem xét trên tiểu cột dưới đường chéo của ma trận. Điều này tương ứng với sự chú ý trung bình được trả cho bản vá cụ thể trong toàn bộ thời gian. Sử dụng phương pháp này, chúng tôi thấy rằng các bản đồ chú ý ở lớp đầu tiên của bộ biến áp là dễ giải thích nhất, đồng ý với những phát hiện của Một số tiêu đề rõ ràng theo dõi các thực thể và khu vực cụ thể nhiệm vụ của hình ảnh. shows the attention maps for manually selected heads at the first layer for several tasks. Nhà hàng Abnar & Zuidema 2020 20 K Kết quả chi tiết cho chuyên gia Meta-World Agent Chuyên viên Meta-World được mô tả trong phần achieves 96.6% success rate averaged over all 50 Meta-World tasks. The detailed success rates are presented in Table Chúng tôi đánh giá đại lý 500 lần cho mỗi nhiệm vụ. 5.5 7. L Per-domain kết quả cho Cat Chúng tôi mô tả hiệu suất của Gato cho các nhiệm vụ điều khiển mô phỏng trong phần Trên bàn Chúng tôi đã đánh giá đại lý 50 lần cho mỗi nhiệm vụ. 4.1. 8, Tài liệu này có sẵn trên archiv dưới giấy phép CC by 4.0 Deed (Attribution 4.0 International). Tài liệu này có sẵn trên archiv dưới giấy phép CC by 4.0 Deed (Attribution 4.0 International).