DeepMind's Gato แสดงให้เห็นว่า AI สามารถเรียนรู้ทุกอย่างในครั้งเดียวได้อย่างไร

ผู้เขียน : Scott Reed Konrad Żołna Emilio Parisotto Sergio Gómez Colmenarejo Alexander Novikov Gabriel Barth-Maron Mai Giménez Yury Sulsky Jackie Kay Jost Tobias Springenberg Tom Eccles Jake Bruce Ali Razavi Ashley Edwards Nicolas Heess Yutian Chen Raia Hadsell Oriol Vinyals Mahyar Bordbar Nando de Freitas ผู้เขียน : สกอต Reed ภรรยา Konrad อิเมลิโอ พารีสโตโต เซอร์จาโกเมซ Colmenarejo อเล็กซานเดอร์ Novikov กราเบิลบาร์ต-มาโรน มากกว่า Giménez อูรี่ซัลสกี้ เจ็กซี่เคย์ โจสต์ Tobias Springenberg โทม Eccles เจ็ค บรูซ อะลี่ Razavi อิสลีย์ เอเดวาร์ดส์ ไนโคล่า Heess จีน Yutian แรียฮาดีเซล วินเทจ Oriol มารดาบาร์ มณฑลซานต้า สารสกัดจาก ที่ได้รับแรงบันดาลใจจากความก้าวหน้าในการจําลองภาษาขนาดใหญ่เราใช้วิธีการที่คล้ายกันในการสร้างตัวแทนทั่วไปเดียวนอกเหนือจากสาขาของการส่งออกข้อความ ตัวแทนที่เราเรียกว่า Gato ทํางานเป็นนโยบายทั่วไปหลายแบบหลายงานหลายตัวแทน เครือข่ายเดียวกันที่มีน้ําหนักเดียวกันสามารถเล่น Atari รูปภาพคําอธิบายแชท บล็อก Stack กับแขนหุ่นยนต์ที่แท้จริงและอื่น ๆ อีกมากมายตัดสินใจขึ้นอยู่กับสภาพแวดล้อมของมันว่าควรส่งออกข้อความสปริงข้อต่อปุ่มหรือเครื่องหมายอื่น ๆ ในรายงานนี้เราอธิบายรูปแบบและข้อมูลและเอกสารความสามารถปัจจุบันของ Gato 1 บทนํา มีประโยชน์อย่างมีนัยสําคัญจากการใช้รุ่นลําดับประสาทแบบเดี่ยวในทุกงาน มันช่วยลดความจําเป็นในการสร้างแบบจําลองนโยบายด้วยแนวโน้มการกระตุ้นที่เหมาะสมสําหรับแต่ละโดเมน มันช่วยเพิ่มปริมาณและความหลากหลายของข้อมูลการฝึกอบรมเนื่องจากรุ่นลําดับสามารถดูดซับข้อมูลใด ๆ ที่สามารถ serialized เป็นลําดับแบน นอกจากนี้ประสิทธิภาพของมันยังคงปรับปรุงแม้ในขอบเขตของข้อมูลการคํานวณและขนาดโมเดล ประวัติศาสตร์โมเดลทั่วไปที่ดีขึ้นในการใช้การคํานวณยังมีแนวโน้มที่จะเอาชนะวิธีการเฉพาะโดเมนที่เชี่ยวชาญมากขึ้น ในที่สุด (ผู้เชี่ยวชาญ et al., ปี 2020 โฮฟแมน et al. 2022) อมควย 2019) ในบทความนี้เราอธิบาย iteration ปัจจุบันของตัวแทนที่มีวัตถุประสงค์ทั่วไปที่เราเรียกว่า Gato, instantiated เป็นหนึ่งขนาดใหญ่รุ่นลําดับการแปลง ด้วยชุดเดียวของน้ําหนัก, Gato สามารถมีส่วนร่วมในการสนทนา, ภาพย่อ, เก็บบล็อกด้วยแขนหุ่นยนต์ที่แท้จริง, ประสิทธิภาพสูงกว่ามนุษย์ในการเล่นเกม Atari, เดินทางในสภาพแวดล้อม 3D แบบจําลอง, ตามคําแนะนําและอื่น ๆ ในขณะที่ไม่มีตัวแทนใด ๆ ที่สามารถคาดหวังว่าจะโดดเด่นในทุกงานควบคุมที่สามารถคิดได้โดยเฉพาะอย่างยิ่งสิ่งที่อยู่ห่างไกลจากการกระจายการฝึกอบรมของเราเราจะทดสอบ hypothesis ที่ฝึกตัวแทนที่สามารถทํางานได้โดยทั่วไป ตัวแทนทั่วไปนี้สามารถปรับแต่งได้ด้วยข้อมูลเพิ่มเติมเล็กน้อยเพื่อให้ประสบความสําเร็จในจํานวนงานที่ยิ่งใหญ่ขึ้น เราพิจารณาว่าตัวแทนดังกล่าวสามารถได้รับผ่านการปรับขนาดข้อมูลการคํานวณและพารามิเตอร์แบบจําลองอย่างต่อเนื่องขยายการกระจายการฝึกอบรมในขณะที่รักษาประสิทธิภาพเพื่อครอบคลุมงานพฤติกรรมและการปฏิบัติงานใด ๆ ของความสนใจ ในสภาพแวดล้อมนี้ Lan-guage ธรรมชาติสามารถทําหน้าที่เป็นพื้นฐานร่วมกันในบรรทัดฐานการปฏิบัติที่ไม่เข้ากันได้ในทางกลับกันปลดล็อคการรวมกันเพื่อพฤติกรรมใหม่ large number เรามุ่งเน้นการฝึกอบรมของเราไปที่จุดดําเนินงานของระดับโมเดลที่ช่วยให้สามารถควบคุมหุ่นยนต์ในโลกจริงได้ในเวลาจริงในขณะนี้ประมาณพารามิเตอร์ 1.2B ในกรณีของ Gato ในขณะที่สถาปัตยกรรมฮาร์ดแวร์และรูปแบบปรับปรุงจุดดําเนินงานนี้จะเพิ่มขนาดรุ่นที่สามารถทําได้โดยธรรมชาติเพิ่มขนาดรุ่นทั่วไปขึ้นขึ้นขึ้นเส้นโค้งกฎหมายการปรับขนาด สําหรับความเรียบง่าย Gato ได้รับการฝึกอบรมแบบออฟไลน์อย่างเป็นธรรม แต่ในหลักการไม่มีเหตุผลที่มันไม่สามารถฝึกอบรมได้ด้วยการเรียนรู้การเสริมสร้างแบบออฟไลน์หรือออนไลน์ (RL) 2 รูปแบบ หลักการออกแบบที่แนะนําของ Gato คือการฝึกอบรมข้อมูลที่เกี่ยวข้องที่หลากหลายที่สุดเท่าที่จะเป็นไปได้รวมถึงรูปแบบต่างๆเช่นภาพข้อความ proprioception แรงบิดร่วมปุ่มกดและการสังเกตและกระทําอื่น ๆ ที่แยกต่างหากและต่อเนื่อง เพื่อให้สามารถประมวลผลข้อมูลหลายแบบนี้ได้เราจะจัดลําดับข้อมูลทั้งหมดเป็นลําดับแบบแบนของโทเค็น ในตัวแทนนี้ Gato สามารถฝึกอบรมและตัวอย่างได้จากรูปแบบภาษาขนาดใหญ่มาตรฐาน ในระหว่างการใช้งานโทเค็นตัวอย่างจะประกอบลงเป็นคําตอบการสนทนาคําอธิบายปุ่มหรือกระทําอื่น ๆ ตามพื้นฐาน ในส่วนต่อไปนี้เราจะอธิบายการ tokenization ของ Gato การสูญเสียสถาปัตยกรรมเครือข่าย ฟังก์ชั่นและการใช้งาน 2.1 Tokenization มีวิธีที่ไม่มีที่สิ้นสุดที่จะแปลงข้อมูลเป็นโทเค็นรวมถึงโดยตรงโดยใช้กระแสแบตเตอรี่ดิบพื้นฐาน ด้านล่างเรารายงานแผนการ tokenization ที่เราพบเพื่อให้ได้ผลลัพธ์ที่ดีที่สุดสําหรับ Gato ในระดับปัจจุบันโดยใช้สถาปัตยกรรมฮาร์ดแวร์และรูปแบบที่ทันสมัย ข้อความถูกเข้ารหัสผ่าน SentencePiece (Kudo & Richardson, 2018) กับ 32000 subwords ในช่วงจํานวนเต็ม [0, 32000] รูปภาพแรกจะถูกแปลงเป็นลําดับของ 16 แพทช์ 16 ที่ไม่ซ้ํากันตามลําดับเรสเตอร์ตามที่ทําใน ViT (Dosovitskiy et al., 2020) แต่ละพิกเซลในภาพ __p__atches จากนั้นจะถูกปกติระหว่าง [−1*,* 1] และแบ่งออกโดยสี่เหลี่ยมรากของขนาดแพทช์ (เช่น √16 = 4) หมายเลขที่แตกต่างกันเช่นปุ่ม Atari จะถูกแปรรูปเป็นลําดับของหมายเลขทั้งในลําดับที่ใหญ่ที่สุด ผลลัพธ์ที่ tokenized คือลําดับของหมายเลขทั้งในช่วง [0*,* 1024) ค่าต่อเนื่องเช่น input proprioceptive หรือ torque joint จะถูกแปรรูปเป็นลําดับของค่าจุดลอยในลําดับที่ใหญ่ที่สุด ค่าที่ถูกเข้ารหัสเป็นช่วง [ 1 * * 1 ] ถ้าไม่ได้อยู่แล้ว (ดูรูป 14 สําหรับรายละเอียด) จากนั้นจะถูกพิจารณาเป็น 1024 บินสม่ําเสมอ จากนั้นตัวเลขทั้งหมดที่แยกต่างหากจะถูกเปลี่ยนไปเป็นช่วง [ 32000 * * 33024 ] หลังจากแปลงข้อมูลเป็นโทเค็นเราใช้การจัดลําดับลําดับแคนอนิกดังต่อไปนี้ แท็กข้อความในลําดับเดียวกันกับข้อความป้อนดิบ รูปแบบ patch tokens ในลําดับ raster Tensors ในลําดับที่ใหญ่ที่สุด โครงสร้าง Nested ในลําดับ lexicographical โดยคีย์ ระยะเวลาของตัวแทนเป็นโทเค็นการสังเกตตามด้วยตัวแยกแล้วโทเค็นการกระทํา ตัวแทนเหตุการณ์ตามระยะเวลาตามลําดับเวลา รายละเอียดเพิ่มเติมเกี่ยวกับข้อมูลตัวแทน tokenizing จะนําเสนอในวัสดุเสริม (ส่วน B) 2.2 การแทรกแท็กอินพุตและตั้งเป้าหมายการส่งออก After tokenization and sequencing, we apply a parameterized embedding function *f* ( ; *θe*) to each token (i.e. it is applied to both observations and actions) to produce the final model input. To enable efficient learning from our multi-modal input sequence *s*1:*L* the embedding function performs different operations depending on the modality the token stems from: • Tokens ที่เป็นส่วนหนึ่งของข้อความ, การสังเกตหรือการกระทําที่มีมูลค่าแบบแยกต่างหากหรือแบบต่อเนื่องสําหรับขั้นตอนเวลาใด ๆ จะถูกบูรณาการผ่านตารางค้นหาเข้าสู่พื้นที่บูรณาการเวกเตอร์ที่เรียนรู้ การเข้ารหัสตําแหน่งที่สามารถเรียนรู้ได้จะถูกเพิ่มสําหรับ tokens ทั้งหมดขึ้นอยู่กับตําแหน่งของ token ในท้องถิ่นภายในขั้นตอนเวลาที่เหมาะสม • Tokens ที่เป็นส่วนหนึ่งของแพทช์ภาพสําหรับทุกขั้นตอนเวลาจะถูกบูรณาการโดยใช้ ResNet เดียว บล็อกเพื่อรับเวกเตอร์ต่อแพทช์ สําหรับการแทรกแท็ก patch รูปภาพเรายังเพิ่มเวกเตอร์การเข้ารหัสตําแหน่งภายในภาพที่สามารถเรียนรู้ได้ (เขา et al., 2016A เราอ้างถึงส่วนเสริม สําหรับรายละเอียดที่สมบูรณ์เกี่ยวกับฟังก์ชั่นแทรก C3 ขณะที่เราจําลองข้อมูลโดยอัตโนมัติ แต่ละโทเค็นอาจเป็นแท็กเป้าหมายเนื่องจากโทเค็นก่อนหน้านี้ โค้กข้อความค่าที่แยกต่างหากและต่อเนื่องและการกระทําสามารถตั้งค่าเป็นเป้าหมายโดยตรงหลังจาก tokenization โค้กรูปภาพและการสังเกตตัวแทนที่ไม่ใช่ข้อความไม่ได้คาดการณ์ใน Gato ในขณะนี้แม้ว่านี่อาจเป็นทิศทางที่น่าสนใจสําหรับการทํางานในอนาคต โค้กเป้าหมายสําหรับโทเค็นที่ไม่คาดการณ์เหล่านี้จะตั้งค่าเป็นค่าที่ไม่ได้ใช้และส่วนร่วมของพวกเขาในการสูญเสียจะถูกซ่อน 2.3 การฝึกอบรม ตามลําดับของโทเค็น 1 : พารามิเตอร์ , เราจําลองข้อมูลโดยใช้กฎของแนวโน้มโซ่: s L θ ล่า ปรับแต่งฟังก์ชั่นการ masking *m* เพื่อให้ *m*(*b, l*) = 1 ถ้า token ที่ดัชนี *l* เป็นจากข้อความหรือจากการกระทําที่บันทึกไว้ของตัวแทนและ 0 อย่างอื่น การสูญเสียการฝึกอบรมสําหรับชุด *B* จากนั้นสามารถเขียนเป็น b ตามที่อธิบายไว้ข้างต้นสถาปัตยกรรมเครือข่ายของ Gato มีองค์ประกอบหลักสององค์ประกอบ: ฟังก์ชั่นการแทรกแบบพารามิเตอร์ซึ่งเปลี่ยน tokens เป็นแทรก token และรุ่นลําดับที่ส่งออกการกระจายผ่าน token ที่แยกต่างหากถัดไป ในขณะที่รุ่นลําดับทั่วไปใด ๆ สามารถทํางานสําหรับการคาดการณ์ token ถัดไปเราได้เลือกตัวแปลง (V) สําหรับความเรียบง่ายและความยืดหยุ่น Gato ใช้พารามิเตอร์ 1.2B decoder-only transformator กับ 24 layers, a embedding size of 2048, และ a post-attention feedforward hidden size of 8196 (รายละเอียดเพิ่มเติมในส่วน) เอเชีย et al. พฤษภาคม 2560 C.1 ) เนื่องจากงานที่แตกต่างกันภายในโดเมนสามารถแบ่งปันตัวเลือกการดําเนินการที่คล้ายกันรูปแบบการสังเกตและการกําหนดค่าการกระทําที่คล้ายกันโมเดลบางครั้งต้องการโครงสร้างพื้นฐานเพิ่มเติมเพื่อแยกแยะงาน แทนที่จะให้ตัวระบุงานแบบเดี่ยวเช่นเราใช้แรงบันดาลใจจาก และใช้การปรับสภาพทันที ในระหว่างการฝึกอบรมสําหรับ 25% ของลําดับในแต่ละชุดลําดับด่วนจะถูกวางไว้ล่วงหน้ามาจากการฉบับที่สร้างขึ้นโดยตัวแทนที่มาพร้อมกับงานเดียวกันครึ่งหนึ่งของลําดับด่วนจะมาจากจุดสิ้นสุดของฉบับที่ทําหน้าที่เป็นรูปแบบของการปรับเป้าหมายสําหรับหลายโดเมนและครึ่งที่เหลือจะถูกรูปแบบอย่างสม่ําเสมอจากฉบับ ในระหว่างการประเมินตัวแทนสามารถนําเสนอโดยใช้การแสดงผลที่ประสบความสําเร็จของงานที่ต้องการซึ่งเราทําตามค่าเริ่มต้นในผลการควบคุมทั้งหมดที่เรานําเสนอที่นี่ (Sanh et al. ) 2022 เฮนและอื่น ๆ 2021 โบรอนและอื่น ๆ 2020) การฝึกอบรมของรุ่นจะดําเนินการบนแผ่น 16x16 TPU v3 สําหรับ 1M ขั้นตอนที่มีขนาดแบตเตอรี่ 512 และความยาวลําดับโทเค็น = 1024 ซึ่งใช้เวลาประมาณ 4 วัน รายละเอียดเกี่ยวกับสถาปัตยกรรมสามารถพบได้ในส่วน เนื่องจากอีพริกชันและเอกสารตัวแทนสามารถมีโทเค็นจํานวนมากมากขึ้นกว่าที่เหมาะสมกับกรอบความสัมพันธ์เราให้ตัวอย่างแบบสุ่ม tokens จาก episodes ที่มีอยู่ แต่ละ batch ผสม subsequences ประมาณสม่ําเสมอทั่วโดเมน (เช่น Atari, MassiveWeb ฯลฯ ) ด้วยการยกน้ําหนักด้วยตนเองของชุดข้อมูลที่มีขนาดใหญ่และมีคุณภาพสูงขึ้น (ดูตาราง) ในส่วน สําหรับรายละเอียด) L C L 1 3 2.4 การใช้งาน การใช้แมวเป็นนโยบายแสดงในภาพ ก่อนหน้านี้การแสดงตัวอย่างเช่นการแสดงตัวอย่างจะถูก tokenized สร้างลําดับเริ่มต้น โดยค่าเริ่มต้นเราจะใช้ tokens ครั้งแรก 1024 ของการแสดงตัวอย่าง ถัดไปสภาพแวดล้อมจะให้การสังเกตครั้งแรกซึ่งจะ tokenized และแนบไปกับลําดับ Gato samples วัตถุประสงค์การกระทําโดยอัตโนมัติ token หนึ่งครั้ง เมื่อวัตถุประสงค์ทั้งหมดประกอบด้วยวัตถุประสงค์การกระทําได้รับการ sampled (กําหนดโดยข้อกําหนดการกระทําของสภาพแวดล้อม) การกระทําจะถูก decoded โดยการเปลี่ยนขั้นตอน tokenization ที่อธิบายไว้ในส่วน การกระทํานี้จะถูกส่งไปยังสภาพแวดล้อมซึ่งจะทําให้เกิดการสังเกตใหม่ ขั้นตอนนี้จะทําซ้ํา โมเดลจะเห็นการสังเกตและกระทําก่อนหน้านี้ทั้งหมดในหน้าต่าง contextual ของ 1024 tokens เราพบว่ามันเป็นประโยชน์ที่จะใช้หน่วยความจํา Transformer XL ในระหว่างการใช้งานแม้ว่าจะไม่ได้ใช้ในระหว่างการฝึกอบรม 3. 2.1 การจัดการ (Dai et al. ) 2019) 3 Datasets Gato ได้รับการฝึกอบรมเกี่ยวกับชุดข้อมูลจํานวนมากรวมถึงประสบการณ์ของตัวแทนทั้งในสภาพแวดล้อมแบบจําลองและในโลกจริงเช่นเดียวกับชุดข้อมูลภาษาธรรมชาติและภาพต่างๆ ชุดข้อมูลที่เราใช้และคุณสมบัติของพวกเขาจะระบุไว้ในตาราง จํานวน tokens ประมาณต่อชุดข้อมูลควบคุมจะคํานวณโดยใช้กลไก tokenization ที่อธิบายไว้ในส่วน 1. 2.1 การจัดการ 3.1 งานควบคุมแบบจําลอง งานควบคุมของเราประกอบด้วยชุดข้อมูลที่สร้างขึ้นโดยผู้เชี่ยวชาญ SoTA หรือตัวแทนการเรียนรู้การเสริมสร้างใกล้ SoTA ที่ได้รับการฝึกอบรมในสภาพแวดล้อมที่แตกต่างกัน สําหรับแต่ละสภาพแวดล้อมเราบันทึกส่วนประกอบของประสบการณ์ที่ตัวแทนสร้าง (สถานะการกระทําและรางวัล) ในระหว่างการฝึกอบรม สภาพแวดล้อมที่จําลองรวมถึง Meta-World (Y) แนะนําการฝึกอบรม meta-reinforcement และ multi-task, Sokoban แนะนําเป็นปัญหาการวางแผน BabyAI สําหรับคําแนะนําภาษาต่อไปนี้ในโลกเครือข่ายชุดควบคุม DM (T) สําหรับการควบคุมอย่างต่อเนื่องเช่นเดียวกับ DM Lab ออกแบบมาเพื่อสอนการนําทางตัวแทนและวิสัยทัศน์ 3D จากพิกเซลดิบด้วยมุมมองที่ภาคภูมิใจ นอกจากนี้เรายังใช้สภาพแวดล้อมการเรียนรู้ Arcade กับเกม Atari คลาสสิค (เราใช้สองชุดของเกมที่เราเรียกว่า ALE Atari และ ALE Atari Extended ดูส่วน สําหรับรายละเอียด) คุณ et al 2020) (Racanière et al., พฤษภาคม 2560 (เชฟวิร์ Boisvert et al., พฤษภาคม 2560 โบสถ์และอื่น ๆ 2020) (Beattie et al. พฤศจิกายน 2560 (Bellemare et al. 2013) ฟ.1 นอกจากนี้เรายังรวม Procgen Benchmark โมดูล RL นอกจากนี้เรายังรวมถึงงานสี่งานที่ใช้แขน Kinova Jaco แบบจําลองจาก DM Manipulation Playground ตามที่แนะนําใน หมวดหมู่ includes a more in-depth description of these control tasks, along with what RL agent was used to generate the data. (Cobbe et al. ) 2020) (Huang et al 2020) โซฟา et al. (2020) F เราพบว่าการฝึกอบรมอย่างมีประสิทธิภาพในการกรองชุดเหตุการณ์ที่มีผลตอบแทนอย่างน้อย 80% ของผลตอบแทนผู้เชี่ยวชาญสําหรับงาน การวัดผลตอบแทนผู้เชี่ยวชาญเป็นผลตอบแทนสูงสุดที่ผู้เชี่ยวชาญสามารถบรรลุได้ เรากําหนดไว้ว่าเป็นผลตอบแทนสูงสุดในชุดของผลตอบแทนเฉลี่ยทั้งหมดที่คํานวณในช่วงเหตุการณ์ทั้งหมดที่รวบรวมสําหรับงาน: ที่ไหน มันเป็นจํานวนครั้งทั้งหมดที่เก็บรวบรวมสําหรับงาน ขนาดของหน้าต่างและ เป็นการกลับมาเต็มรูปแบบสําหรับเหตุการณ์ เพื่อให้ได้คํานวณที่แม่นยําในทางปฏิบัติเราตั้งค่า ควรเป็น 10% ของปริมาณข้อมูลทั้งหมดหรืออย่างน้อย 1000 episodes (เช่น = นาที(1000 * * 0 * * 1 ) ) N W แร่ i W W × N 3.2 วิสัยทัศน์และภาษา Gato ได้รับการฝึกอบรมใน MassiveText คอลเลกชันของชุดข้อมูลข้อความภาษาอังกฤษขนาดใหญ่จากแหล่งต่างๆ: หน้าเว็บหนังสือบทความข่าวและรหัส (Rae et al. ) 2019) นอกจากนี้เรายังรวมชุดข้อมูลหลายภาษาวิสัยทัศน์ในการฝึกอบรมของ Gato ALIGN ประกอบด้วยภาพ 1.8B และคําอธิบายข้อความทางเลือก (alt-text) ของพวกเขา LTIP (Long Text & Image Pairs) ประกอบด้วยภาพ 312 ล้านภาพที่มีคําอธิบาย , Captions แนวคิด และ COCO Captions , บันทึกชุดข้อมูลที่มีคู่รูปภาพและข้อความ 3.3M และ 120k ตามลําดับ ชุดข้อมูล MultiModal MassiveWeb (M3W) . , รวมถึง 43M หน้าเว็บซึ่งทั้งข้อความและภาพถูกสกัด นอกจากนี้เรายังรวมถึงชุดข้อมูลการตอบคําถามภาพ โดยเฉพาะอย่างยิ่ง OKVQA และ VQAv2 ด้วยภาพ 9K และ 443K Triplets ของคําถามและคําตอบ เพื่อให้เป็นส่วนหนึ่งของการฝึกอบรมจากเหล่านี้เราจะสุ่มตัวอย่างห้าคู่ (ภาพข้อความ) tokenize พวกเขา concatenate แล้ว pad หรือ randomly crop ไปยังความยาวลําดับการฝึกอบรมที่ต้องการ (Jia et al. พฤศจิกายน 2560 (Alayrac et al. ) 2022) (Sharma et al. ) พฤษภาคม 2560 (Chen et al. ) 2015) (Alayrac et al 2022) มาริโอ และ al 2019) (แอนทอล et al., 2015) 3.3 Robotics - RGB Stacking Benchmark (จริงและ sim) As a testbed for taking physical actions in the real world, we chose the robotic block stacking environment introduced by [Lee et al.](#_bookmark89) [(2021).](#_bookmark89) The environment consists of a Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action. The robot’s workspace contains three plastic blocks colored red, green and blue with varying shapes. The available observations include two 128 128 camera images, robot arm and gripper joint angles as well as the robot’s end-effector pose. Notably, ground truth state information for the three objects in the basket is not observed by the agent. Episodes have a fixed length of 400 timesteps at 20 Hz for a total of 20 seconds, and at the end of an episode block positions are randomly re-positioned within the workspace. The robot in action is shown in Figure [4.](#_bookmark8) There are two challenges in this benchmark: *Skill Mastery* (where the agent is provided data from the 5 test object triplets it is later tested on) and *Skill Generalization* (where data can only be obtained from a set of training objects that excludes the 5 test sets). เราใช้ข้อมูลการฝึกอบรมหลายแหล่งสําหรับงานเหล่านี้ ใน Skill Generalization สําหรับการจําลองและจริงเราใช้ข้อมูลที่รวบรวมโดยตัวแทน sim2real ที่ดีที่สุดจาก เรารวบรวมข้อมูลเฉพาะเมื่อโต้ตอบกับ RGB-stacking ที่กําหนด (รวมถึงเส้นทางที่ประสบความสําเร็จ 387k ในการจําลองและเส้นทางที่ประสบความสําเร็จ 15k ในความเป็นจริง) สําหรับ Skill Mastery เราใช้ข้อมูลจากผู้เชี่ยวชาญที่ดีที่สุดต่อกลุ่มจาก ในจําลองและจากนโยบาย sim2real ที่ดีที่สุดบนหุ่นยนต์จริง (จํานวน 219k เส้นทางทั้งหมด) โปรดทราบว่าข้อมูลนี้รวมเฉพาะสําหรับการทดลองทักษะเฉพาะในส่วน เลีย et al. (2021) วัตถุการฝึกอบรม Lee et al. (2021) 5.4 4 Capabilities of the generalist agent In this section, we summarize the performance of Gato when trained on the above described data. That is, all results across all tasks are derived from a single pretrained model with a single set of weights. Results with fine-tuning will be presented in Section 5. 4.1 งานควบคุมแบบจําลอง Figure shows the number of distinct control tasks for which Gato performs above a given score threshold, relative to expert performance demonstrated in Gato’s training data. 5 We report performance as a percentage, where 100% corresponds to the per-task expert and 0% to a random policy. For each simulated control task we trained our model on, we roll out the Gato policy on the corresponding environment 50 times and average the defined scores. As shown in Figure Gato performs over 450 out of 604 tasks at over a 50% expert score threshold. 5 In ALE Atari Gato ได้รับคะแนนโดยเฉลี่ยของมนุษย์ (หรือดีกว่า) สําหรับ 23 เกม Atari , achieving over twice human score for 11 games. While the single-task online RL agents which generated the data still outperform Gato, this may be overcome by adding capacity or using offline RL training rather than purely supervised (see Section ที่เรานําเสนอผู้เชี่ยวชาญโดเมนเดี่ยว ALE Atari ตัวแทนบรรลุคะแนนที่ดีกว่ามนุษย์สําหรับ 44 เกม) (Bellemare et al., 2013) 1 5.5 On BabyAI Gato achieves over 80% of expert score for nearly all levels สําหรับงานที่ยากที่สุดที่เรียกว่า BossLevel Gato มีคะแนน 75%. สองขั้นพื้นฐานที่เผยแพร่อื่น ๆ ที่เราสามารถหาได้ BabyAI 1.0 และ BabyAI 1.1 , scored 77% and 90%, respectively, having trained on this single task alone using a million demonstrations. (Chevalier-Boisvert et al., 2018) 2 (Hui et al. ) 2020) เกี่ยวกับ Meta-World (Y Gato achieves more than 50% for all 44 out of 45 tasks that we trained on, over 80% for 35 tasks, and over 90% for 3 tasks. On canonical DM Control Suite (T Gato ได้รับคะแนนผู้เชี่ยวชาญมากกว่า 50% ใน 21 จาก 30 งานจากรัฐและมากกว่า 80% สําหรับ 18 งาน u et al., 2020) assa et al., 2018), 4.2 Robotics First person teleoperation enables the collection of expert demonstrations. However, such demonstrations are slow and costly to collect. Data-efficient behavior cloning methods are therefore desirable for training a generalist robot manipulator and offline pretraining is thus a well-motivated area of research. To that end, we evaluated Gato on the established RGB Stacking benchmark for robotics. ทักษะการรวมประสิทธิภาพ The Skill Generalization challenge from the RGB Stacking robotics benchmark tests the agent’s ability to stack objects of previously unseen shapes. The agent is trained on a dataset consisting of episodes of the robot stacking objects with a variety of different shapes. Five triplets of object shapes are, however, not included in the training data and serve as test triplets. We evaluated the trained generalist for 200 episodes per test triplet on the real robot. Table shows that our generalist agent’s success rate on each test triplet is comparable to the single task BC-IMP (filtered BC) baseline in 2 Lee et al. (2021) 4.3 Text samples แบบจําลองแสดงให้เห็นถึงความสามารถในการสนทนาแบบ rudimentary และภาพ captioning รูปภาพ contains a rep-resentative sample of Gato’s image captioning performance. Figure shows some hand-picked examples of plain text dialogue exchange. 6 7 5 Analysis 5.1 Scaling Laws Analysis In Figure we analyze the aggregate in-distribution performance of the pretrained model as a function of the number of parameters in order to get insight into how performance could improve with increased model capacity. We evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato). We refer to Section for details on the three model architectures. 8, C Here, for all three model sizes we plot the normalized return as training progresses. To get this single value, for each task we calculate the performance of the model as a percentage of expert score (the same as done in Section 1). Then for each domain listed in Table we average the percentage scores across all tasks for that domain. Finally, we mean-aggregate the percentage scores across all domains. We can see that for an equivalent token count, there is a significant performance improvement with increased scale. 4. 1 5.2 ออกจากงานกระจาย In this section we want to answer the following question: ด้วยเหตุนี้เราได้จัดเก็บข้อมูลทั้งหมดสําหรับงานสี่งานจากชุดก่อนการฝึกอบรมของเรา: cartpole.swingup (โดเมน DM Control Suite), assembly-v2 (โดเมน Meta-World), order_of_apples_forage_simple (โดเมน DM Lab) และ boxing (โดเมน ALE Atari) งานเหล่านี้จะทําหน้าที่เป็นเตียงทดสอบสําหรับการประเมินความสามารถของ Gato ที่ไม่มีการกระจาย Can our agent be used to solve a completely new task efficiently? Ideally, the agent could potentially learn to adapt to a new task via conditioning on a prompt including demonstrations of desired behaviour. However, due to accelerator memory constraints and the extremely long sequence lengths of tokenized demonstrations, the maximum context length possible does not allow the agent to attend over an informative-enough context. Therefore, to adapt the agent to new tasks or behaviours, we choose to fine-tune the agent’s parameters on a limited number of demonstrations of a single task, and then evaluate the fine-tuned model’s performance in the environment. Fine-tuning is very similar to pretraining with minor changes, such as different learning rate schedule; see Section สําหรับรายละเอียด E We want to measure how choice of data used during pretraining influences post-fine-tuning performance. To this end, we compare Gato (trained on ) to variants trained on ablated datasets: all data 1. A model pretrained only on data from the same domain as the task to be fine-tuned on, . same domain only data 2. รูปแบบที่ได้รับการฝึกอบรมไว้ล่วงหน้าเฉพาะจากข้อมูลที่ไม่ใช่การควบคุม . ไม่มีข้อมูลควบคุม 3. A model fine-tuned from scratch, i.e. no pretraining at all, . scratch Considering as all these experiments require training a new model from scratch and then also fine-tuning, we present results using the less compute-intensive 364M parameter architecture described in Section Results are shown in Figure 5.1. 9. Fine-tuning performance on both cartpole.swingup and assembly-v2 tasks, both of which do not require image processing, present similar trends. Pretraining on all the datasets yields the best results, followed by pretraining on the same domain only. This difference is smaller for assembly-v2 but consistent for all few shot datasets. For these non-image-based environments, we see either no benefit (cartpole.swingup) or even negative transfer (assembly-v2) when pretraining on datasets, which only contain images and text data. no control Results for DM Lab order_of_apples_forage_simple are slightly different. Pretraining on DM Lab data only is already enough to approach the maximum reward of 19 and hence there is no observable benefit of adding data from different environments. What is different when compared to previously analysed no-vision environments is that pretraining on data helps, which can be possibly explained by the fact that agents in the DM Lab environment are fed images which, despite being simulated, are natural looking. Therefore, transfer from image captioning or visual grounded question answering tasks is possible. no control We were not able to observe any benefit from pretraining on boxing. The randomly initialized model seems to work better than any of the pretrained variants considered. We hypothesise that this is caused by the game’s input images being visually very distinct from the other data, suggesting transfer is difficult. We discuss this Atari challenge further in our related work section. 5.3 Fine-tuning on Robotic Stacking Tasks Section demonstrates that the base Gato capable of a diverse array of tasks can perform competitively on the RGB Stacking Skill Generalization benchmark. In this section, we would like to answer the following question: *How does our agent improve on robotics tasks when allowed to fine-tune similarly to how we fine-tune on new tasks in Section *We consider different model sizes and analyse the impact of pretraining datasets on the Skill Generalization benchmark, as well as a novel out of distribution task. Further analysis of fine-tuning with dataset ablations is in Appendix 4.2 5.2 ? I. Skill Generalization First, we would like to show that fine-tuning on object-specific data, similarly to what was done by is beneficial. Therefore, we fine-tuned Gato separately on five subsets of demonstrations from the dataset. Each subset was obtained by random partitioning of a test dataset consisting of demonstrations gathered by a generalist sim-to-real agent stacking real test objects. We consider this setting, which is comparable to the fine-tuning baselines on RGB stacking tasks from และใช้ชุดข้อมูล 5k ที่ผลการคลินิกพฤติกรรม 5k ของพวกเขาจะได้รับด้วย เพื่อให้ตรงกับการทดลองของพวกเขาได้ดีที่สุดเราเปลี่ยนแผนการกรองการคืนเงินของเราในระหว่างการฝึกอบรม: แทนที่จะใช้สแต็คที่ประสบความสําเร็จเท่านั้นเราจะใช้ผลการคืนเงินปกติของ episodes Lee et al. (2022), test (Lee et al., 2022); รูปภาพ compares the success rate of Gato across different fine-tuning data regimes to the sim-to-real expert and a Critic-Regularized Regression (CRR) agent trained on 35k episodes of all test triplets. Gato, in both reality and simulation (red curves on the left and right figure, respectively), recovers the expert’s performance with only 10 episodes, and peaks at 100 or 1000 episodes of fine-tuning data, where it exceeds the expert. After this point (at 5000), performance degrades slightly but does not drop far below the expert’s performance. 10 (Wang et al., 2020) Fine-tuning and Model Size เพื่อให้เข้าใจประโยชน์ของโมเดลขนาดใหญ่สําหรับการปรับตัวในพื้นที่ของหุ่นยนต์ได้ดีขึ้นเราได้ดําเนินการ ablation บนขนาดพารามิเตอร์ของโมเดล ส่วนนี้มุ่งเน้นไปที่การประเมินการจําลอง รูปภาพ compares the full 1.18B parameter Gato with the smaller 364M and 79M parameter variants for varying amounts of fine-tuning data. Although the 364M model overfits on one episode, causing performance to drop, there is a clear trend towards better adaptation with fewer episodes as the number of parameters is scaled up. The 79M model performs clearly worse than its bigger counterparts. The results suggest that the model’s greater capacity allows the model to use representations learned from the diverse training data at test time. 10 Adaptation to Perceptual Variations While the Skill Generalization task is an effective benchmark for motor Skill Generalization to shape varia-tions, it does not test the agent’s ability to adapt to perceptual variations and permutations in the objective specification. To further evaluate Gato’s generalization capabilities, we devised a new task in the RGB stacking benchmark where the goal is to stack the blue object on the green object, for test triplet 1 (see Figure First, we used a 3D mouse to collect 500 demonstrations of this task on the real robot, for a total of 2 hours and 45 minutes of demonstration data, and fine-tuned Gato on these episodes. Notably, all of the simulated and real robotics data in the pretraining set shows the robot successfully stacking the red object on the blue object, and the data does not include the object shapes in the test set. We found that additionally adding simulated demonstrations of the stack blue on green task to the fine-tuning dataset improved performance, and 10% was an ideal sampling ratio for this data. 11). We achieved a final 60% success rate after evaluating fine-tuned Gato on the real robot, while a BC baseline trained from scratch on the blue-on-green data achieved only 0.5% success (1/200 episodes). Qualitatively, the BC baseline would consistently move towards the blue object and occasionally pick it up and place it on top of the green object, but a full, stable stack was almost never achieved. 5.4 Robotics: Skill Mastery Similarly to the Skill Generalization challenge discussed in Section the Skill Mastery challenge consists in training a robotic arm to stack blocks of different shapes. However, the Skill Mastery allows the agent to train on data involving the object shapes used for evaluation, i.e. the set in Skill Generalization becomes a part of the Skill Mastery set. Thus, this challenge serves to measure Gato’s performance on in-distribution tasks (possibly with initial conditions not seen in the training demonstrations). Our Skill Mastery results use an earlier version of the Gato architecture described in Appendix with no fine-tuning. 4.2, test training H, Table compares the group-wise success percentage and the average success across object groups for Gato and the established BC-IMP baseline. Gato exceeds or closely matches BC-IMP’s performance on all but one training triplet. 3 5.5 ผู้เชี่ยวชาญเดี่ยวโดเมน multi-task agents In this section we show results obtained with two specialist (rather than generalist) agents. Both of them were trained on data from a single domain only and rolled out 500 times for each training task without any per-task fine-tuning. Meta-World The first agent uses the smallest architecture introduced in Section i.e. 79M parameters, and is trained on all 50 Meta-World tasks. While Gato has access to the state of the MuJoCo physics engine and unlimited task seeds, the agent presented here has no access to any extra features or tasks and uses the canonical API as in (Y การทดลองนี้คือการแสดงให้เห็นว่าสถาปัตยกรรมที่นําเสนอในบทความของเราสามารถใช้เพื่อให้ได้ตัวแทนทันสมัยแม้ในขนาดเล็ก ขั้นตอนการฝึกอบรมคือการฝึกอบรม MPO แบบเดี่ยว experts on each of the MT-50 tasks individually, recording the trajectories produced while training. This experience is then combined, or distilled, into a single agent, which achieves 96.6% success rate averaged over all 50 tasks. To the best of our knowledge this agent is the first one to accomplish nearly 100% average success rate simultaneously (multi-task) for this benchmark. See Table in the supplementary material (Section สําหรับรายการงานที่สมบูรณ์และอัตราความสําเร็จที่เกี่ยวข้องของตัวแทนของเรา 5.1, u et al., 2020). (Abdolmaleki et al., 2018) 7 K) ALE Atari We also trained a specialist agent on all 51 ALE Atari tasks. As the Atari domain is much more challenging than Meta-World, we used the Gato architecture with 1.18B parameters. The resulting agent performs better than the average human for 44 games (see Section for details on our evaluation and scoring). We want to note that the performance of online experts used to generate training data for the other 7 games were also below the average human. Hence, the specialist Atari agent achieved better than human performance for all games where data contained super-human episodes. 4.1 ตัวแทนผู้เชี่ยวชาญของ Atari ยอดเยี่ยมกว่าตัวแทนทั่วไปของเรา Gato ซึ่งได้ประสบความสําเร็จใน 23 เกม การปรับขนาดของ Gato อาจนําไปสู่ประสิทธิภาพที่ดีขึ้น อย่างไรก็ตามเราได้ จํากัด ขนาดของ Gato เพื่อให้สามารถทํางานได้ในเวลาจริงบนหุ่นยนต์จริง 5.6 Attention Analysis We rendered the transformer attention weights over the image observations for various tasks, to gain a qualitative sense of how Gato attends to different regions of the image across tasks (see Figure Further details and visualizations for more tasks can be found in Appendix These visualizations clearly show that attention tracks the task-relevant objects and regions. 12) จ 5.7 Embedding Visualization To understand how Gato encodes differently information per task, we visualized per-task embeddings. We analysed 11 tasks. For each task, we randomly sample 100 episodes and tokenize each of them. Then, from each episode we take a subsequence of 128 tokens, compute their embeddings (at layer 12, which is half the total depth of the transformer layers) and average them over the sequence. The averaged embeddings for all tasks are used as input to PCA, which reduces their dimensionality to 50. Then, T-SNE is used to get the final 2D embeddings. Figure shows the final T-SNE embeddings plotted in 2D, colorized by task. Embeddings from the same tasks are clearly clustered together, and task clusters from the same domain and modality are also located close to each other. Even held-out task (cartpole.swingup) is clustered correctly and lays next to another task from DM Control Suite Pixels. 13 6 งานที่เกี่ยวข้อง The most closely related architectures to that of Gato are Decision Transformers , and Trajectory Transformer which showed the usefulness of highly generic LM-like architectures for a variety of control problems. Gato also uses an LM-like architecture for control, but with design differences chosen to support multi-modality, multi-embodiment, large scale and general purpose deployment. Pix2Seq also uses an LM-based architecture for object detection. Perceiver IO ., uses a transformer-derived architecture specialized for very long sequences, to model any modality as a sequence of bytes. This and similar architectures could be used to expand the range of modalities supported by future generalist models. (Chen et al 2021b; Reid et al., 2022; Zheng et al., 2022; Furuta et al. พฤศจิกายน 2560 (Janner et al., 2021), (Chen et al 2022) (Jaegle et al 2021) Gato was inspired by works such as GPT-3 และ Gopher pushing the limits of generalist language models; and more recently the Flamingo generalist visual language model. developed the 540B parameter Pathways Language Model (PalM) explicitly as a generalist few-shot learner for hundreds of text tasks. (Brown et al., 2020) (Rae et al., 2021), (Alayrac et al., 2022) Chowdhery et al. (2022) Future work should consider how to unify these text capabilities into one fully generalist agent that can also act in real time in the real world, in diverse environments and embodiments. Gato ยังได้รับแรงบันดาลใจจากงานล่าสุดเกี่ยวกับการควบคุมแบบต่อเนื่องหลายตัว used message passing graph networks to build a single locomotor controller for many simulated 2D walker variants. showed that transformers can outperform graph based approaches for incom-patible (i.e. varying embodiment) control, despite not encoding any morphological inductive biases. learn a modular policy for multi-task and multi-robot transfer in simulated 2D manipulation environments. train a universal policy conditioned on a vector representation of robot hardware, showing successful transfer both to simulated held out robot arms, and to a real world sawyer robot arm. Huang et al. (2020) Kurin et al. (2020) Devin และอัล (2017) Chen et al. (2018 ) A variety of earlier generalist models have been developed that, like Gato, operate across highly distinct domains and modalities. NPI trained a single LSTM to execute diverse programs such as sorting an array and adding two numbers, such that the network is able to generalize to larger problem instances than those seen during training. developed the MultiModel that trains jointly on 8 distinct speech, image and text processing tasks including classifica-tion, image captioning and translation. Modality-specific encoders were used to process text, images, audio and categorical data, while the rest of the network parameters are shared across tasks. proposed “ ”, อธิบายวิธีการสําหรับการฝึกอบรมขั้นสูงของผู้แก้ปัญหาทั่วไปมากขึ้น แบบจําลองภาษาหลายงานที่สามารถควบคุมได้ซึ่งสามารถควบคุมได้ตามโดเมนภาษาโดเมนย่อย entities ความสัมพันธ์ระหว่าง entities วันที่และพฤติกรรมเฉพาะงาน (Reed & De Freitas, พฤศจิกายน 2560 (Hochreiter & Schmidhuber, กรกฎาคม Kaiser et al. (ปี 2017 ) Schmidhuber (2018 ) one big net for everything Keskar et al. (2019) In this discussion, it is important to distinguish between one single multi-task network architecture versus one single neural network with the same weights for all tasks. Several poplar RL agents achieve good multi-task RL results within single domains such as Atari57 and DMLab However, it is much more common to use the same policy architecture and hyper-parameters across tasks, but the policy parameters are different in each task This is also true of state-of-the-art RL methods applied to board games Moreover, this choice has been adopted by off-line RL benchmarks and recent works on large sequence neural networks for control, including decision transformers and the Trajectory Transformer of In contrast, in this work we learn a single network with the same weights across a diverse set of tasks. (Espeholt et al., 2018; Song et al., 2020; Hessel et al., 2019). (Mnih et al., 2015 Tassa et al., 2018). (Schrittwieser et al., 2020). (Gulcehre et al., 2020; Fu et al., 2020) (Chen et al., 2021b; Reid et al., 2022 Zheng et al., 2022) Janner et al. (2021) Recent position papers advocate for highly generalist models, notably proposing one big net for everything, and on foundation models. However, to our knowledge there has not yet been reported a single generalist trained on hundreds of vision, language and control tasks using modern transformer networks at scale. Schmidhuber (2018) Bommasani et al. (2021) “Single-brain”-style models have interesting connections to neuroscience. famously stated that “ ”. Mountcastle found that columns of neurons in the cortex behave similarly whether associated with vision, hearing or motor control. This has motivated arguments that we may only need one algorithm or model to build intelligence Mountcastle (1978) the processing function of neocortical modules is qualitatively similar in all neocortical regions. Put shortly, there is nothing intrinsically motor about the motor cortex, nor sensory about the sensory cortex (Hawkins & Blakeslee, 2004). Sensory substitution provides another argument for a single model For example, it is possible to build tactile visual aids for blind people as follows. The signal captured by a camera can be sent via an electrode array on the tongue to the brain. The visual cortex learns to process and interpret these tactile signals, endowing the person with some form of “vision”. Suggesting that, no matter the type of input signal, the same network can process it to useful effect. (Bach-y Rita & Kercel, 2003). Our work is based on deep autoregressive models, which have a long history and can be found in generative models of text, images, video and audio. Combining autoregressive generation with transformers (V has been of enormous impact in language modelling protein folding รูปแบบการมองเห็นภาษา (T code generation dialogue systems with retrieval capabilities การรับรู้การพูด neural machine translation and more , Recently researchers have explored task decomposition and grounding with language models aswani et al., 2017; โบสถ์และอื่น ๆ 2018) (Brown et al., 2020; Rae et al., 2019) (Jumper et al. ) 2021), simpoukelli et al., 2021; Wang et al., 2021; Alayrac et al., 2022), (Chen et al., 2021c; เลียและเอล 2022b), (Nakano et al., 2021; Thoppilan et al., 2022), พูดคุย et al., 2020), (Johnson et al., 2019) (Bommasani et al. 2019) (Huang et al., 2022; เฮน et al., 2022). construct a control architecture, consisting of a sequence tokenizer, a pretrained language model and a task-specific feed-forward network. They apply it to VirtualHome and BabyAI tasks, and find that the inclusion of the pretrained language model improves generalisation to novel tasks. Similarly, demonstrate that vision models pretrained with self-supervised learning, especially crop segmentations and momentum contrast can be effectively incorporated into control policies. Li et al. (2022a) Parisi et al. (2022) (เขา et al., 2020) As mentioned earlier, transfer in Atari is challenging. พวกเขาค้นพบว่า Atari เป็นโดเมนที่ยากที่จะถ่ายโอนเนื่องจากความแตกต่างที่โดดเด่นในภาพควบคุมและกลยุทธ์ระหว่างเกมที่แตกต่างกัน ความยากลําบากเพิ่มเติมที่เกิดขึ้นเมื่อใช้พฤติกรรมโคลนกับเกมวิดีโอเช่น Atari จะถูกกล่าวถึงโดย Rusu et al. พฤศจิกายน 2560 Kanervisto et al. (2020). There has been great recent interest in data-driven robotics However, โปรดทราบว่าในหุ่นยนต์” ”. Moreover, every time we update the hardware in a robotics lab, we need to collect new data and retrain. We argue that this is precisely why we need a generalist agent that can adapt to new embodiments and learn new tasks with few data. (Cabi et al., 2019; Chen et al., 2021a). Bommasani et al. (2021) the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments การสร้างการกระทําโดยใช้โมเดลอัตโนมัติสามารถนําไปสู่ความผิดปกติของ "การหลอกลวงตนเอง" ที่เกิดจากเหตุผลเมื่อมีตัวแปรที่ทําให้เกิดความสับสน For example, sampling actions can condition the model to solve the wrong task when multiple tasks share similar observation and actions specifications. As explained in Section we use prompt engineering in ambiguous tasks, conditioning our model on a successful demon-stration. This screens off confounding variables, reducing self-delusions. Another solution which we did not explore in this work is to use counterfactual teaching, where we train a model online using instantaneous expert feedback. We leave this for future investigation. (Ortega et al., 2021). 2, 7 ผลกระทบที่กว้างขึ้น Although generalist agents are still only an emerging area of research, their potential impact on society calls for a thorough interdisciplinary analysis of their risks and benefits. For the sake of transparency, we document the intended use cases of Gato in the model card in Appendix However, the tools for mitigating harms of generalist agents are relatively underdeveloped, and require further research before these agents are deployed. A. เนื่องจากตัวแทนทั่วไปของเราสามารถทําหน้าที่เป็นรุ่นภาษาวิสัยทัศน์ได้จึงเป็นเจ้าของความกังวลที่คล้ายคลึงกันตามที่กล่าวถึงในบทความนี้ In addition, generalist agents can take actions in the the physical world; posing new challenges that may require novel mitigation strategies. For example, physical embodiment could lead to users anthropomorphizing the agent, leading to misplaced trust in the case of a malfunctioning system, or be exploitable by bad actors. Additionally, while cross-domain knowledge transfer is often a goal in ML research, it could create unexpected and undesired outcomes if certain behaviors (e.g. arcade game fighting) are transferred to the wrong context. The ethics and safety considerations of knowledge transfer may require substantial new research as generalist systems advance. (Wei-dinger และอื่น ๆ ) 2021; Bommasani et al., 2021 Rae et al., 2021; อะลูมิเนียม et al. 2022). Technical AGI safety may also become more challenging when considering generalist agents that operate in many embodiments. For this reason, preference learning, uncertainty modeling and value alignment (R มีความสําคัญโดยเฉพาะอย่างยิ่งสําหรับการออกแบบตัวแทนทั่วไปที่เข้ากันได้กับมนุษย์ อาจเป็นไปได้ที่จะขยายวิธีการปรับค่าบางอย่างสําหรับภาษา to generalist agents. However, even as technical solutions are developed for value alignment, generalist systems could still have negative societal impacts even with the intervention of well-intentioned designers, due to unforeseen circumstances or limited oversight This limitation underscores the need for a careful design and a deployment process that incorporates multiple disciplines and viewpoints. (Bostrom, 2017) ussell, 2019) (Ouyang et al., 2022; Kenton et al., 2021) (Amodei et al., 2016 ) Understanding how the models process information, and any emergent capabilities, requires significant ex-perimentation. External retrieval ได้รับการพิสูจน์แล้วว่าปรับปรุงทั้งความสามารถในการตีความและประสิทธิภาพและดังนั้นจึงควรพิจารณาในรูปแบบของตัวแทนทั่วไปในอนาคต (Borgeaud et al., 2021; Menick et al., 2022 Nakano et al., 2021; โซฟา et al., 2022) Although still at the proof-of-concept stage, the recent progress in generalist models suggests that safety researchers, ethicists, and most importantly, the general public, should consider their risks and benefits. We are not currently deploying Gato to any users, and so anticipate no immediate societal impact. However, given their potential impact, generalist models should be developed thoughtfully and deployed in a way that promotes the health and vitality of humanity. 8 ข้อ จํากัด และงานในอนาคต 8.1 RL data collection Gato เป็นวิธีการที่ใช้ข้อมูลเนื่องจากมาจากการเรียนรู้แบบจําลอง ในขณะที่ชุดข้อมูลภาษาธรรมชาติหรือภาพค่อนข้างง่ายที่จะได้รับจากเว็บชุดข้อมูลขนาดเว็บสําหรับงานควบคุมในขณะนี้ไม่สามารถใช้ได้ สิ่งนี้อาจดูเป็นปัญหาในตอนแรกโดยเฉพาะอย่างยิ่งเมื่อปรับขนาด Gato ไปยังพารามิเตอร์จํานวนมากขึ้น That being said, there has already been extensive investigation into this issue. Offline RL aims at leveraging existing control datasets, and its increasing popularity has already resulted in the availability of more diverse and larger datasets. Richer environments and simulations are being built (e.g. Metaverse), and increasing numbers of users already interact with them among thousands of already deployed online games (e.g. there exists a large dataset of Starcraft 2 games). Real-life data has also been already stored for ML research purposes; for example, data for training self-driving cars is acquired from recording human driver data. Finally, while Gato uses data consisting of both observations and corresponding actions, the possibility of using large scale observation-only data to enhance agents has been already studied (Baker et al., 2022). Thanks to online video sharing and streaming platforms such as Youtube and Twitch, observation-only datasets are not significantly more difficult to collect than natural language datasets, motivating a future research direction to extend Gato to learn from web data. While the previous paragraph focuses on alleviating drawbacks of data collection from RL agents, it is important to note that this approach presents a different set of tradeoffs compared to scraping web data and can be actually more practical in some situations. Once the simulation is set up and near SOTA agent trained, it can be used to generate massive amounts of high quality data. That is in contrast to the quality of web data which is notorious for its low quality. ในระยะสั้นเราเชื่อว่าการได้รับข้อมูลที่เหมาะสมเป็นคําถามการวิจัยอื่น ๆ ในตัวเองและนี่เป็นพื้นที่ที่ใช้งานของการวิจัยที่มีแรงโน้มถ่วงและความสําคัญที่เพิ่มขึ้น 8.2 Prompt and short context Gato จะได้รับคําแนะนําด้วยตัวอย่างผู้เชี่ยวชาญซึ่งช่วยให้ตัวแทนส่งออกการกระทําที่สอดคล้องกับงานที่กําหนด นี้เป็นประโยชน์โดยเฉพาะอย่างยิ่งเนื่องจากไม่มีตัวระบุงานที่สามารถใช้ได้กับตัวแทน (ซึ่งแตกต่างจากการตั้งค่า RL มัลติฟังก์ชั่มากมาย) Gato จะส่งออกงานที่เกี่ยวข้องจากการสังเกตและการกระทําในคําแนะนํา However, the context length of our agent is limited to 1024 tokens which translates to the agent sometimes attending to only a few environment timesteps in total. This is especially the case for environments with image observations, where depending on the resolution each observation can result in more than one hundred tokens each. Hence for certain environments only a short chunk of a demonstration episode fits in the transformer memory. Due to this limited prompt context, preliminary experiments with different prompt structures resulted in very similar performance. Similarly, early evaluations of the model using prompt-based in-context learning on new environments did not show a significant performance improvement compared to prompt-less evaluation in the same setting. Context-length is therefore a current limitation of our architecture, mainly due to the quadratic scaling of self-attention. Many recently proposed architectures enable a longer context at greater efficiency and these innovations could potentially improve our agent performance. We hope to explore these architectures in future work. 9 ข้อสรุป แบบจําลองลําดับ Transformer มีประสิทธิภาพในฐานะนโยบาย multi-task multi-embodiment รวมทั้งสําหรับข้อความในโลกจริง, vision และงานหุ่นยนต์ พวกเขายังแสดงความสัญญาในการเรียนรู้งานไม่กี่ครั้งในอนาคตโมเดลดังกล่าวสามารถใช้เป็นจุดเริ่มต้นเริ่มต้นเริ่มต้นผ่านการแจ้งเตือนหรือการปรับแต่งเพื่อเรียนรู้พฤติกรรมใหม่แทนการฝึกอบรมจากจุดเริ่มต้น เนื่องจากแนวโน้มการขยายตัวของกฎหมายประสิทธิภาพการทํางานในทุกงานรวมถึงการสนทนาจะเพิ่มขึ้นพร้อมกับการขยายตัวในพารามิเตอร์ข้อมูลและการคํานวณ การสร้างสถาปัตยกรรมฮาร์ดแวร์และเครือข่ายที่ดีขึ้นจะช่วยให้การฝึกอบรมรูปแบบขนาดใหญ่ในขณะที่รักษาความสามารถในการควบคุมหุ่นยนต์แบบเรียลไทม์ โดยการขยายตัวและซ้ําตามวิธีการพื้นฐานเดียวกันนี้เราสามารถสร้างตัวแทนที่มีประโยชน์สําหรับวัตถุประสงค์ทั่วไป Acknowledgments We would like to thank Dan Horgan, Manuel Kroiss, Mantas Pajarskas, and Thibault Sottiaux for their help with data storage infrastructure; Jean-Baptiste Lespiau and Fan Yang for help on concurrent evalua-tion; Joel Veness for advising on the model design; Koray Kavukcuoglu for helping inspire the project and facilitating feedback; Tom Erez for advising on the agent design and task selection for continuous control; Igor Babuschkin for helping code the initial prototype; Jack Rae for advising on the transformer language model codebase; Thomas Lampe for building robot infrastructure and advising on real robotics experiments; Boxi Wu for input on ethics and safety considerations; Pedro A. Ortega for advice in regard to causality and self-delusion biases. Author Contributions developed the project concept, wrote the initial prototype, and led the project overall. led architecture development for vision and text, built infrastructure for tokenization and prompting, and contributed heavily to overall agent development and evaluation. Scott Reed Konrad Żołna led work on optimizing the transformer architecture, ran the largest number of experi-ments, and analyzed scaling law properties and in-distribution agent performance. Emilio Parisotto เป็นผู้นําทางเทคนิคที่รับผิดชอบในการสร้างตัวโหลดข้อมูลและเครื่องประเมินข้อมูลที่สามารถปรับขนาดได้ซึ่งรองรับการทํางานหลายร้อยงานพร้อมกันและสําหรับการบูรณาการหุ่นยนต์ขั้นต้นกับ Gato Sergio Gómez Colmenarejo developed the model including the sampler for the initial prototype, carried out ex-periments focusing on robotics, and created visualizations. Alexander Novikov สร้างโครงสร้างพื้นฐานการจัดเก็บข้อมูลที่สามารถปรับขนาดได้เพื่อให้ Gato มีความเชี่ยวชาญด้านตัวแทนระดับ SoTA ใน Atari และโดเมนอื่น ๆ Gabriel Barth-Maron conducted large scale agent data collection, built substantial data loading infrastructure, and integrated large scale visual-language datasets into the training of Gato. Mai Giménez contributed broadly to the Gato codebase including a bespoke distributed training sequence loader, and led the development of benchmarks for out-of-distribution generalization, and the training of competitive baseline agents. Yury Sulsky supported physical robotics infrastructure, conducted numerous evaluations and experiments to analyze the generalization properties of Gato, and contemplated broader ethical impact. Jackie Kay guided Gato’s deployment to the physical robot, provided strong existing base-lines for block stacking, and advised on model development and experimental design. Jost Tobias Springenberg developed the Gato dialogue and image captioning demonstrations, allowing users to easily probe the vision and language capacities of agents in development. Tom Eccles contributed to agent design as well as control datasets and environments with randomized physics and morphology variations. Jake Bruce helped in exploring vision architectures. Ali Razavi contributed to the first prototype of Gato that worked on Atari, in addition to exploring alternative network architectures and training objectives. Ashley Edwards advised on agent design, experiment design and task selection, especially for continuous control applications. Nicolas Heess ให้คําปรึกษาเกี่ยวกับการออกแบบแบบจําลองและทดลองและให้ความคิดเห็นในประชุมปกติ Yutian Chen advised on the design and planning of robotics efforts. Raia Hadsell advised on all aspects of the project, especially model architecture, training strategies and benchmark design. Oriol Vinyals was the primary project manager; eliciting key goals, tracking progress, facilitating pre-sentations and feedback, and coordinating resource planning. Mahyar Bordbar oversaw the project from its inception. Nando de Freitas References Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation. , 2018. Preprint arXiv:1806.06920 Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. , 2020. Preprint arXiv:2005.00928 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. ปี 2022 Preprint arXiv:2204.01691 Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, และ Karen Simonyan. Flamingo: รุ่นภาษาภาพสําหรับการเรียนรู้แบบไม่กี่ครั้ง , 2022. Preprint arXiv:2204.14198 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. , 2016. Preprint arXiv:1606.06565 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In , pp. 2425–2433, 2015. International Conference on Computer Vision Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. , 2016. Preprint arXiv:1607.06450 Paul Bach-y Rita and Stephen W Kercel. Sensory substitution and the human-machine interface. , 7(12):541–546, 2003. Trends in cognitive sciences Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. , 2022. Preprint arXiv::2206.11795 Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. , 2018. Preprint arXiv:1804.08617 Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik et al. ห้องปฏิบัติการ DeepMind , 2016. Preprint arXiv:1612.03801 Marc G Bellemare, Yavar Naddaf, Joel Veness และ Michael Bowling. สภาพแวดล้อมการเรียนรู้แบบอาร์เคด: แพลตฟอร์มการประเมินสําหรับตัวแทนทั่วไป , 47:253–279, 2013. Journal of Artificial Intelligence การวิจัยอัจฉริยะ Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. , 2021. พิมพ์ก่อน arXiv:2108.07258 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. , 2021. Preprint arXiv:2112.04426 Nick Bostrom. . Dunod, 2017. Superintelligence Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang และ Wojciech Zaremba พฤษภาคม 2016 Preprint arXiv:1606.01540 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. In , pp. 1877–1901, 2020. ความก้าวหน้าในระบบประมวลผลข้อมูลประสาท Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. , 2019. Preprint arXiv:1909.12200 Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the-wild" human videos. 2021A แนะนํา arXiv:2103.16817 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar-avind Srinivas และ Igor Mordatch ตัวแปลงการตัดสินใจ: การเรียนรู้เสริมสร้างผ่านการสร้างแบบจําลองแบบลําดับ , 34, 2021b. Advances in Neural Information Processing Systems Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. , 2021c. Preprint arXiv:2107.03374 Tao Chen, Adithyavairavan Murali, และ Abhinav Gupta นโยบายฮาร์ดแวร์ที่กําหนดเองสําหรับการเรียนรู้การถ่ายโอนหลายหุ่นยนต์ , 31, 2018. Advances in Neural Information Processing Systems Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ปี 2022 ICLR Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. , 2015. การพิมพ์ล่วงหน้า arXiv:1504.00325 Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. พฤษภาคม 2561 Preprint arXiv:1810.08272 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. , 2022. Preprint arXiv:2204.02311 Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In , หน้า 2048–2056, 2020 International Conference on Machine Learning Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le และ Ruslan Salakhutdinov. Transformer-xl: รูปแบบภาษาที่ระมัดระวังนอกเหนือจากกรอบความยาวคงที่ ใน , pp. 2978–2988, 2019. Annual Meeting of the Association for Computational Linguistics Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In , pp. 2169–2176, 2017. IEEE International Conference on Robotics & Automation Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec-tional transformers for language understanding. , 2018. Preprint arXiv:1810.04805 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un-terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. พฤษภาคม 2020 Preprint arXiv:2010.11929 Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning et al. Impala: การกระจาย Deep-RL ที่สามารถปรับขนาดได้ด้วยสถาปัตยกรรมนักแสดงและผู้เรียนรู้ที่มีความสําคัญ , pp. 1407–1416, 2018. International Conference on Machine Learning Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. , 2020. การพิมพ์ล่วงหน้า arXiv:2004.07219 Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching. , 2021. Preprint arXiv:2111.10364 Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Thomas Paine, Sergio Gómez, Konrad Zolna, Rishabh Agarwal, Josh S Merel, Daniel J Mankowitz, Cosmin Paduraru, et al. RL unplugged: A suite of benchmarks for offline reinforcement learning. , 33:7248–7259, 2020. Advances in Neural Information Processing Systems Jeff Hawkins และ Sandra Blakeslee . Macmillan, 2004. On intelligence Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pp. 770–778, 2016a. IEEE Computer Vision and Pattern Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In , pp. 630–645, 2016b. การประชุมยุโรปเกี่ยวกับวิสัยทัศน์คอมพิวเตอร์ Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In , pp. 9729–9738, 2020. IEEE Computer Vision and Pattern Recognition Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). , 2016. Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Preprint arXiv:1606.08415 การเรียนรู้ Deep Reinforcement ด้วย Popart , 2019. อะไร Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization. , 2021. Preprint arXiv:2104.06159 Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. , 9(8):1735–1780, 1997. Neural computation Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. , 2022. Preprint arXiv:2203.15556 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. , 2016. Preprint arXiv:1603.09382 Wenlong Huang, Igor Mordatch, and Deepak Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In , pp. 4455–4464, 2020. International Conference on Machine Learning Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. , 2022. Preprint arXiv:2201.07207 David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. , 2020. Preprint arXiv:2007.12770 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs. , 2021. แนะนํา arXiv:2107.14795 Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. , 34, 2021. Advances in Neural Information Processing Systems Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In , pp. 4904–4916, 2021 International Conference on Machine Learning Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In , pp. 3874–3884, 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. , 596(7873):583–589, 2021. ธรรมชาติ Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. , 2017. Preprint arXiv:1706.05137 Anssi Kanervisto, Joonas Pussinen, and Ville Hautamäki. Benchmarking end-to-end behavioural cloning on video games. In , pp. 558–565, 2020. IEEE conference on games (CoG) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. , 2020. Preprint arXiv:2001.08361 Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos และ Will Dabney ประสบการณ์ซ้ํากันในการเรียนรู้การเสริมสร้างแบบกระจาย ใน พฤษภาคม 2561 International Conference on Learning Representations Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. , 2021. Preprint arXiv:2103.14659 Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. , 2019. Preprint arXiv:1909.05858 Diederik P. Kingma และ Jimmy Ba. Adam: วิธีการเพิ่มประสิทธิภาพทางสต็อกสติก , 2014. Preprint arXiv:1412.6980 Taku Kudo และ John Richardson. SentencePiece: ง่ายและภาษาที่เป็นอิสระ subword tokenizer และ detokenizer สําหรับการประมวลผลข้อความประสาท ใน , Annual Meeting of the Association for Computational Linguistics pp. 66–71, 2018. Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control. , 2020. Preprint arXiv:2010.01856 Alex X Lee, Coline Manon Devin, Yuxiang Zhou, Thomas Lampe, Konstantinos Bousmalis, Jost Tobias Springenberg, Arunkumar Byravan, Abbas Abdolmaleki, Nimrod Gileadi, David Khosid, et al. Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In ปี 2021 Conference on Robot Learning Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol-maleki และ Konstantinos Bousmalis วิธีใช้เวลาหุ่นยนต์ของคุณ: การเชื่อมต่อการเริ่มต้นและการเรียนรู้การเสริมสร้างแบบออฟไลน์สําหรับการจัดการหุ่นยนต์ตามภาพ , 2022. การพิมพ์ล่วงหน้า arXiv:2205.03353 Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba และ Yuke Zhu รูปแบบภาษาที่ได้รับการฝึกอบรมล่วงหน้าสําหรับการตัดสินใจแบบโต้ตอบ 2022A Preprint arXiv:2202.01771 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode. 2022B Preprint arXiv:2203.07814 Ilya Loshchilov และ Frank Hutter ความผิดปกติของความเสียหายของน้ําหนักที่แยกออก , 2017. แนะนํา arXiv:1711.05101 Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-VQA: A visual question answering benchmark requiring external knowledge. In ,pp. 3195–3204, 2019. IEEE Computer Vision and Pattern Recognition Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. , 2022. พิมพ์ก่อน arXiv:2203.11147 Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In , pp. 220–229, 2019. ขั้นตอนการประชุมเกี่ยวกับความยุติธรรมความรับผิดชอบและความโปร่งใส Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski et al. การควบคุมระดับมนุษย์ผ่านการเรียนรู้การเสริมสร้างลึก , 518(7540):529–533, 2015 ธรรมชาติ Vernon Mountcastle. An organizing principle for cerebral function: the unit module and the distributed system. 1978 The mindful brain Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. , 2021. Preprint arXiv:2112.09332 Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. พฤษภาคม 2016 Preprint arXiv:1609.03499 Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. , 2021. Preprint arXiv:2110.10819 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. ปี 2022 Preprint arXiv:2203.02155 Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec-tiveness of pre-trained vision models for control. , 2022. Preprint arXiv:2203.03580 Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters. พฤษภาคม 2020 Preprint arXiv:2007.03001 Sébastien Racanière, Théophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. , 30, 2017. Advances in Neural Information Processing Systems Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. , 2021. Preprint arXiv:2112.11446 Scott Reed and Nando De Freitas. Neural programmer-interpreters. In , 2016. International Conference on Learning Representations Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help offline reinforcement learning? , 2022. Preprint arXiv:2201.12122 Stuart Russell. Penguin, 2019 Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray ความเข้ากันได้กับมนุษย์: วิศวกรรมอัจฉริยะและปัญหาการควบคุม Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. , 2016. Preprint arXiv:1606.04671 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abhehtes Sharma, Andrea Santilli, Thibault Fryevry, Alan Jason Fries, Ryan Teehan, Teven Le Sca , 2022. International Conference on Learning Representations Jürgen Schmidhuber. One big net for everything. , 2018. Preprint arXiv:1802.08864 Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel ฯลฯ การปกครอง atari, go, chess และ shogi โดยการวางแผนด้วยรูปแบบที่เรียนรู้ , 588(7839):604–609, 2020. Nature Piyush Sharma, Nan Ding, Sebastian Goodman, และ Radu Soricut. คําอธิบายแนวคิด: ชุดข้อมูลภาพอัลตร้าเท็กซ์ที่ทําความสะอาดและซับซ้อนเพื่อการอ้างอิงภาพโดยอัตโนมัติ ใน , pp. 2556–2565, 2018 Annual Meeting of the Association for Computational Linguistics Noam Shazeer. Glu variants improve transformer. , 2020. Preprint arXiv::2002.05202 H Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala ฯลฯ V-mpo: On-policy maximum a posteriori policy optimization สําหรับการควบคุมที่ละเอียดอ่อนและต่อเนื่อง , 2020. ICLR Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. , 15(56): 1929–1958, 2014. Journal of Machine Learning Research Richard Sutton. The bitter lesson. , 13:12, 2019 ความคิดที่ไม่สมบูรณ์ (บล็อก) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite. , 2018. Preprint arXiv:1801.00690 Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. , 2022. การพิมพ์ล่วงหน้า arXiv:2201.08239 Emanuel Todorov, Tom Erez, และ Yuval Tassa. Mujoco: เครื่องยนต์ทางกายภาพสําหรับการควบคุมแบบจําลอง , pp. 5026–5033, 2012 International Conference on Intelligent Robots and Systems Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. , pp. 200-212, 2021 ความก้าวหน้าในระบบประมวลผลข้อมูลประสาท Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. , 6:100022, 2020 ผลกระทบของซอฟต์แวร์ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. , 30, 2017. Advances in Neural Information Processing Systems Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov และ Yuan Cao. Simvlm: แบบจําลองภาษาภาพที่เรียบง่ายการฝึกอบรมก่อนด้วยการดูแลที่อ่อนแอ , 2021. การพิมพ์ล่วงหน้า arXiv:2108.10904 Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. , 33:7768–7778, 2020. Advances in Neural Information Processing Systems Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, และ Quoc V Le. รูปแบบภาษา Finetuned เป็นผู้เรียนรู้แบบไร้รอยต่อ , 2021. การพิมพ์ล่วงหน้า arXiv:2109.01652 Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh et al. ความเสี่ยงทางจริยธรรมและสังคมของความเสียหายจากรูปแบบภาษา , 2021. Preprint arXiv:2112.04359 Yuxin Wu and Kaiming He. Group normalization. In , pp. 3–19, 2018. European Conference on Computer Vision Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In , pp. 1094–1100, 2020. การประชุมเกี่ยวกับการเรียนรู้หุ่นยนต์ Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. ปี 2022 Preprint arXiv:2202.05607 Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas และ Scott Reed การเรียนรู้แบบออฟไลน์จากการแสดงผลและประสบการณ์ที่ไม่ติดฉลาก พฤษภาคม 2020 Preprint arXiv:2011.13885 Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gómez Colmenarejo, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, and Ziyu Wang. Task-relevant adversarial imitation learning. In , pp. 247–263, 2021. Conference on Robot Learning Supplementary Material A Model card We present a model card for Gato in Table 4. ตารางที่ 4: We follow the framework proposed in Gato Model Card. (Mitchell et al 2019). B Agent Data Tokenization Details In this section we provide additional details on our tokenization schemes. Our agent data is sequenced as follows: • are presented to the agent in order of time (timesteps). Episodes • in turn are presented in the following order: Timesteps ( [ ] 1: 1: 1 : ]) are ordered lexicographically by key, each item is sequenced as follows: – Observations y k, x M, Z n เท็กซ์โทเค็น ( 1 : ) are in the same order as the raw input text. y k ∗ Image patch tokens ( 1: ) are in raster order. x m ∗ Tensors ( 1: ) (such as discrete and continuous observations) are in row-major order. z n – (' '); a designated separator token is provided after observations. Separator | – (ที่ 1: ) are tokenized as discrete or continuous values and in row-major order. Actions a A A full sequence of tokens is thus given as the concatenation of data from T timesteps: ที่ L = T(k + m + n + 1 + A) เป็นจํานวน tokens ทั้งหมด Each floating point element of tensors in the observation sequence is mu-law companded as in WaveNet (Oord et al., 2016): with parameters µ = 100 and M = 256. (If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range \[ 1, 1\] for all our environments.) All the elements are subsequently clipped so that they fall in the set \[ 1, 1\]. Finally, they are discretized using bins of uniform width on the domain \[ 1,1\]. We use 1024 bins and shift the resulting integers so they are not overlapping with the ones used for text tokens. The tokenized result is therefore a sequence of integers within the range of \[32000, 33024). รูปภาพ และตัวเลข สําหรับการแสดงภาพของค่า tokenizing และ sequencing (ทั้ง discrete และ con-tinuous) และภาพ ดูส่วน for details about local position encodings referenced in the figures. 14 15 C C Model Architecture C.1 Transformer Hyperparameters The transformer hyperparameters of Gato are presented in Table We also list the hyperparameters of smaller architecture variants used in Section 5. 5. C.2 Embedding Function The ResNet block uses the v2 architecture ประกอบด้วย GroupNorm with 32 groups instead of LayerNorm and GELU activation functions instead of RELU. The block is diagrammed in Figure (เขา et al., 2016B (Wu & เขา, พฤษภาคม 2560 (Ba et al) 2016 ) (Hendrycks & Gimpel, 2016) 16. C.3 Position Encodings After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. Patch Position Encodings การเข้ารหัสตําแหน่งเหล่านี้ส่งข้อมูลเกี่ยวกับตําแหน่งทั่วโลกของแพทช์ภายในภาพจากที่แพทช์ถูกสกัด ประการแรกช่วงแถวและคอลัมน์ของแพทช์จะคํานวณโดยการกําหนดค่าช่วงพิกเซลของแพทช์โดยการกําหนดค่าภาพ ระยะแถวและคอลัมน์ที่กําหนดค่าปกติจะถูกวัดเป็นขนาดคําศัพท์ (เราใช้ 128) และใช้เพื่อแสดงตัวอักษรแถวและคอลัมน์ของการเข้ารหัสตําแหน่งที่สามารถเรียนรู้ได้ วิธีการที่ช่วงแถวและคอลัมน์ที่มีปริมาณจะถูกแปลงเป็นดัชนีขึ้นอยู่กับว่าเรากําลังฝึกอบรมหรือประเมินรูปแบบหรือไม่: ในระหว่างการฝึกอบรมตัวอักษรแบบสุ่มจะถูกรูปแบบอย่างสม่ําเสมอจากช่วงที่มีปริมาณมากในขณะที่ในระหว่างการประเมินเราจะใช้ตัวอักษ เพื่อพิสูจน์กระบวนการนี้อย่างมีนัยสําคัญมากขึ้นเราให้ตัวอย่างใน รูปที่ [17](#_bookmark144) เราจะทําตามกระบวนการด้วยแพทช์ที่โดดเด่นในสีแดงที่ด้านซ้ายของรูปภาพ รูปภาพมีความละเอียด 80 64 และแต่ละแพทช์มี 16 16 ซึ่งหมายความว่ามี 5 4 = 20 ปลั๊กทั้งหมด ปลั๊กที่โดดเด่นเริ่มต้นที่ช่วงบรรทัดพิกเซล \[16*,* 32\] และช่วงคอลัมน์พิกเซล \[32*,* 64\] โดยปกติแล้วช่วงบรรทัดจึงคือ \[0*25*,* 0*.*5\] และช่วงคอลัมน์คือ \[0*4*,* 0*.*6\] จากนั้นเราจะคํานวณช่วงบรรทัดแยกต่างหากเป็น 128 บินที่มีพื้นที่สม่ําเสมอซึ่ง การเข้ารหัสตําแหน่งการสังเกตท้องถิ่น The local observation position encoding adds positional information about where observation tokens are positioned within the local time-step they were an element of. First, we reiterate that, during tokenization, for each time-step all elements of the observation set are tokenized into sequences and concatenated into an observation sequence. Each token in this observation sequence is given an index which corresponds to the sequence order, i.e. the first token is 0 and the last is the length of the observation sequence minus one. After embedding, for any tokens that were a part of an observation set, the corresponding observation token index is used to index an embedding table of learnable position encodings, with one embedding for every possible observation token index (in practice we simply set the table size to a large value like 512). / The position encoding is then added onto the observation token embedding to produce the final token embedding. Note that all action tokens are given the same position encoding regardless of their position in the time-step sequence. We illustrate an example of this process in Figure 18. D การฝึกอบรมการติดตั้ง For all models we use the AdamW เครื่องเพิ่มประสิทธิภาพที่มีการอุ่นเชิงเส้นและความเสียหายของตาราง cosine การอุ่นเชิงเส้นใช้เวลา 15 * * 000 ขั้นตอนเริ่มต้นจากอัตราการเรียนรู้ 1e-7 และสิ้นสุดด้วยอัตราการเรียนรู้สูงสุดที่แตกต่างกันขึ้นอยู่กับรุ่น (ดูตาราง) This learning rate is then cosine decayed by a factor 10x over 1,000,000 steps. The AdamW optimizer has parameters 1 = 0*. 2 = 0.*95 and = 1e-8. เราใช้ขนาดพาร์ทิชัน 512 และความยาวลําดับ 1024 tokens สําหรับทุกรุ่น Optimizer: (Loshchilov & Hutter, 2017) 6). β 9, β ϵ We train with an AdamW weight decay parameter of 0.1. Additionally, we use stochastic depth ในระหว่างการฝึกอบรมล่วงหน้าซึ่งแต่ละชั้นอะไหล่ของเครื่องแปลง (เช่นแต่ละชั้น Multi-Head Attention และ Dense Feedforward) จะถูกล้มเหลวด้วยความน่าจะเป็น 0.1 Regularization: (Huang et al., พฤศจิกายน 2560 E Fine-tuning การตั้งค่า For all models we use the Adam ตัวเพิ่มประสิทธิภาพที่มีอัตราการเรียนรู้ที่คงที่ 1e-5 ตัวเพิ่มประสิทธิภาพ Adam มีพารามิเตอร์ 1 = 0 * 2 = 0.*95 and = 1e-8. We use a batch size of 64 and a sequence length of 1024 tokens for all models. We train for 10,000 gradient steps. Optimizer: (Kingma และ Ba) 2014) β 9, β ϵ เราใช้ dropout ด้วยอัตราของ 0.1 Regularization: (Srivastava et al. ) 2014) เราประเมินตัวแทนทุก 100 ขั้นตอนการเรียนรู้ การประเมินแต่ละรายงานเฉลี่ยของ 10 ขั้นตอนของจุดตรวจสอบที่กําหนด โดยเฉลี่ยเคลื่อนย้ายของ 5 คะแนนดังกล่าวจะคํานวณ (เพื่อรวบรวม 50 ขั้นตอน) ประสิทธิภาพการปรับความละเอียดสุดท้ายจะถูกกําหนดเป็นสูงสุดของคะแนนที่เรียบร้อยเหล่านี้ Evaluation: เราสร้างข้อมูลสําหรับงานการปรับความละเอียดในลักษณะเดียวกันกับงานอื่น ๆ (ดูส่วน 3.1 สําหรับรายละเอียด) แทนที่จะใช้ข้อมูลทั้งหมดสําหรับงานการปรับความละเอียดเราลบข้อมูลทั้งหมดยกเว้น 2000 episodes ที่ดีที่สุด (นําไปสู่ผลตอบแทนที่สูงที่สุด) ชุดข้อมูลการปรับความละเอียดถูกสร้างขึ้นในลักษณะดังต่อไปนี้ เราใช้แบบสุ่ม 1000 episodes (จาก 2000 episodes ที่เลือกไว้ล่วงหน้า) จากนั้น subset ของ 100 episodes จาก episodes ที่เลือกแล้ว 10, 5, 3 และในที่สุดหนึ่ง episodes เราทําซ้ําขั้นตอนนี้ 3 ครั้งเพื่อให้ได้ 3 ชุดของ subset ของการปรับความละเอียดสําหรับแต่ละงาน แต่ละ subset ใช้เพื่อดําเนินการทดลองการปรับความละเอียดและแต่ละชุดจะถูกรายงานในบทความของเรา as a separate point. Datasets: 5.2 We have not altered any of the tasks and used their canonical versions. As 3 out of 4 tasks are open sourced, they do not need further explanation. For the fourth task, DMLab order_of_apples_forage_simple, the goal is to collect apples in the right order, green ones first followed by the gold one. Task settings: F Data Collection Details F1 อตารี เรารวบรวมสองชุดที่แยกต่างหากของสภาพแวดล้อม Atari ครั้งแรก (ที่เราเรียกว่า ALE Atari) ประกอบด้วย 51 เกม Canonical จากสภาพแวดล้อมการเรียนรู้ Arcade ประการที่สอง (ที่เราเรียกว่า ALE Atari Extended) เป็นชุดของเกมทางเลือก with their game mode and difficulty randomly set at the beginning of each episode. (Bellemare et al. 2013) 3 สําหรับแต่ละสภาพแวดล้อมในชุดเหล่านี้เราเก็บรวบรวมข้อมูลโดยการฝึกอบรม Muesli agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. (Hessel et al 2021) F2 โซโคบาน Sokoban is a planning problem ในที่ตัวแทนต้องกดกล่องไปยังสถานที่เป้าหมาย บางส่วนของการเคลื่อนไหวเป็นไปไม่ได้และดังนั้นข้อผิดพลาดสามารถทําให้การพนันไม่สามารถแก้ได้ การวางแผนล่วงหน้าจึงเป็นสิ่งจําเป็นเพื่อให้ประสบความสําเร็จในการพนันนี้ เราใช้ Muesli agent to collect training data. (Racanière et al., 2017 ) (Hessel et al., 2021) F.3 BabyAI BabyAI เป็นสภาพแวดล้อม gridworld ที่มีระดับประกอบด้วยงานตามคําแนะนําที่อธิบายโดยภาษาสังเคราะห์ เราสร้างข้อมูลสําหรับระดับเหล่านี้โดยใช้บอท BabyAI ในตัว บอทที่มีการเข้าถึงข้อมูลเพิ่มเติมซึ่งใช้เพื่อดําเนินการโซลูชั่นที่ดีที่สุด ดูส่วน C ในแถบเสริมของ for more details about the bot. We collect 100,000 episodes for each level. (เชฟวิร์ Boisvert et al., 2018) F.4 DeepMind Control Suite ชุดควบคุม DeepMind (T) . , is a set of physics-based simulation environments. For each task in the control suite we collect two disjoint sets of data, one using only state features and another using only pixels. We use a D4PG ตัวแทนเพื่อรวบรวมข้อมูลจากงานที่มีคุณสมบัติสถานะและ MPO ตัวแทนที่ใช้ในการรวบรวมข้อมูลโดยใช้พิกเซล เคล็ดลับและอื่น ๆ ปี 2020 Tassa et al., พฤษภาคม 2560 (Barth-Maron และอื่น ๆ ) พฤษภาคม 2560 (Abdolmaleki et al. ) พฤษภาคม 2560 นอกจากนี้เรายังเก็บรวบรวมข้อมูลสําหรับรุ่นแบบสุ่มของงานชุดควบคุมด้วยตัวแทน D4PG รุ่นเหล่านี้จะสุ่มเกียร์ตัวกระตุ้นช่วงข้อต่อความแข็งและความหนาแน่นและขนาดและความหนาแน่นของ geom มีสองการตั้งค่าความยากลําบากสําหรับรุ่นแบบสุ่ม การตั้งค่าขนาดเล็กจะสแกนค่าโดยตัวเลขสุ่มตัวอย่างจากสหภาพช่วง [0* * 0 * [ 1 ] * 1 * . * 0 * [1. , * 1 * * 4 9 95] ∪ 05 1]. การตั้งค่าขนาดใหญ่สแกนค่าโดยตัวอย่างตัวเลขสุ่มจากสมาคมของช่วง [0 6 8] ∪ 2 F.5 DeepMind Lab ห้องปฏิบัติการ DeepMind , เป็นสภาพแวดล้อม 3D ในบุคคลแรกที่ออกแบบมาเพื่อสอนตัวแทนภาพ 3D จากข้อมูลเข้าพิกเซลดิบด้วยมุมมอง, การนําทางและการวางแผน (Beattie et al. พฤศจิกายน 2560 เราได้ฝึกอบรม IMPALA ตัวแทนร่วมกันในชุด 18 ระดับผู้ปกครอง DM Lab ที่สร้างแผนที่ตามขั้นตอนสําหรับแต่ละส่วนใหม่ ข้อมูลถูกเก็บรวบรวมโดยการดําเนินการตัวแทนใน 18 ระดับเหล่านี้รวมถึงชุดเพิ่มเติมของ 237 ระดับที่ทําด้วยมือเพื่อทดสอบชุดทักษะที่หลากหลาย (ผู้เชี่ยวชาญและอื่น ๆ ) พฤษภาคม 2560 ระดับผู้ปกครอง 18 มีลักษณะโดยความหลากหลายสูงของแผนที่ที่สร้างขึ้น ความแตกต่างระหว่างระดับนั้นเกิดจากพารามิเตอร์ไฮเปอร์ที่ใช้ในกระบวนการผลิต พารามิเตอร์ไฮเปอร์เหล่านี้ควบคุมคุณสมบัติระดับสูงเช่นประเภทของโครงสร้างที่สร้างขึ้นความยากลําบากของคําแนะนําภาษาหรือการปรากฏตัวของเครื่องมือที่เฉพาะเจาะจง ระดับผู้ปกครองได้รับการพัฒนาเพื่อปรับปรุงประสิทธิภาพของตัวแทน RL ที่ได้รับการฝึกอบรมออนไลน์ In contrast to the parent levels, each of the additional handcrafted 237 levels uses almost the same map, and the main differences between instances of the same level map are aesthetics such as colors of walls or lighting conditions. The maps are procedurally generated and were designed to test a diverse set of skills such as walking up stairs or using specific tools. They are similar to levels presented in Figure 3, Figure 7 and Figure 8 in aforementioned paper by ไม่ บาติน et al. (2016). ข้อมูลเพิ่มเติมเกี่ยวกับ 18 ระดับผู้ปกครอง (และความสัมพันธ์ของพวกเขากับระดับอื่น ๆ) จะนําเสนอในรายละเอียดในการพูดคุยใน NeurIPS Workshop by Daniel Tanis . วิธีการวิจัยด้านสิ่งแวดล้อม RL 4 In total, we collected data for 255 levels from the DeepMind Lab (18 parent levels and 237 handcrafted levels), 254 of which were used while training Gato. The remaining level was used for out of distribution evaluation. F.6 Procgen Benchmark Procgen เป็นชุดของ 16 สภาพแวดล้อมที่สร้างขึ้นตามขั้นตอนเช่น Atari ซึ่งถูกนําเสนอเพื่อเปรียบเทียบประสิทธิภาพตัวอย่างและการรวมกันในการเรียนรู้การเสริมสร้าง การเก็บรวบรวมข้อมูลดําเนินการในระหว่างการฝึกอบรม R2D2 ตัวแทนในแต่ละสภาพแวดล้อม เราใช้การตั้งค่าความยากลําบากอย่างหนักสําหรับสภาพแวดล้อมทั้งหมดยกเว้น Labyrinth และ Robbery ซึ่งเราตั้งค่าเป็นเรื่องง่าย (Cobbe et al., 2020) (Kapturowski et al. ) พฤษภาคม 2560 F.7 โมดูล RL โมดูล RL is a collection of MuJoCo (T ขึ้นอยู่กับสภาพแวดล้อมการควบคุมอย่างต่อเนื่องซึ่งประกอบด้วยสามชุดของตัวเลือกของ OpenAI Gym Walker2d-v2, Humanoid-v2 และ Hopper-v2 แต่ละตัวแปรเป็นการเปลี่ยนแปลงทางพยาธิวิทยาของร่างกายเดิม: ชุดของพยาธิวิทยาจะถูกสร้างขึ้นโดยการนัดหมายทุกชุดย่อยของขอบเขตที่เป็นไปได้และเก็บไว้เฉพาะชุดที่ a) ประกอบด้วยกระดูกสันหลังและ b) ยังคงเป็นกราฟที่เชื่อมต่อ ซึ่งจะนําไปสู่ชุดของตัวแปรที่มีขนาดการป้อนข้อมูลและเอาต์พุตที่แตกต่างกันเช่นเดียวกับไดนามิกที่แตกต่างจาก morphologies ที่เดิม เราได้รวบรวมข้อมูลโดยการฝึกอบรมตัวแทน D4PG แบบเฉพาะทางพยาธิวิทยาเดียวในแต่ละตัวแปรเพื่อรวม 140M ขั้นตอนของผู้เล่นซึ่งทําได้สําหรับ 30 เมล็ดสุ่มต่อตัวแปร (Huang et al., 2020) odorov et al., 2012) (Brockman et al. พฤศจิกายน 2560 F.8 DeepMind การจัดการสนามเด็กเล่น The DeepMind Manipulation Playground เป็นชุดของงานหุ่นยนต์ที่จําลองขึ้นอยู่กับ MuJoCo เราเก็บรวบรวมข้อมูลสําหรับ 4 ของงาน Jaco (กล่องถังถังแทรกและสไลด์) โดยใช้ตัวแทน CRR (Regularized Regression) ข้อมูลที่รวบรวมรวมถึงสถานะทางกายภาพของ MuJoCo ซึ่งเราใช้สําหรับการฝึกอบรมและการประเมิน Gato (Zolna et al. ) พฤศจิกายน 2560 (ก้น et al. ) 2020) F.9 โลกเมตา โลกเมตา (Y เป็นชุดของสภาพแวดล้อม สําหรับการเปรียบเทียบการเรียนรู้ meta-reinforcement และการเรียนรู้หลายงาน เราเก็บรวบรวมข้อมูลจากงานฝึกอบรมและทดสอบทั้งหมดในโหมด MT50 โดยการฝึกอบรมตัวแทน MPO มีเมล็ดสภาพแวดล้อมไม่ จํากัด และสามารถเข้าถึงสถานะของมอเตอร์ฟิสิกส์ MuJoCo ข้อมูลที่รวบรวมยังรวมถึงสถานะของมอเตอร์ฟิสิกส์ MuJoCo คุณ et al 2020) 5 (Abdolmaleki et al., 2018) G รายละเอียดการประเมิน robotics จริง ในโลกจริงการควบคุมไม่ซิงโครนัส; ฟิสิกส์ไม่รอการคํานวณที่จะเสร็จสิ้น ดังนั้นความล่าช้าในการสรุปเป็นความกังวลในการประเมินรูปแบบขนาดใหญ่สําหรับงานในโลกจริง ในโรบอิตศาสตร์อัตราการควบคุมที่รวดเร็วถือว่าเป็นสิ่งสําคัญในการตอบสนองต่อปรากฏการณ์แบบไดนามิก การตั้งค่าหุ่นยนต์สําหรับการรวบรวม RGB มีอัตราการควบคุม 20Hz (0.05 วินาที) โดยการออกแบบ เพื่อให้บรรลุขอบเขตความล่าช้าที่ยอมรับได้เราได้ปรับเปลี่ยนการสรุปในเวลาการประเมินโดยการสั้นลงความยาวของแง่มุมถึง 1 นอกจากนี้เรายังใช้แผนการตัวอย่างแบบขนานซึ่ง tokens ของการกระทําทั้งหมดจะถูกปิดในลําดับการป้อนข้อมูลในระหว่างการฝึกอบรมเพื่อให้เราสามารถตัวอย่าง tokens เราใช้ฟังก์ชั่นรางวัลที่อธิบายไว้ใน สําหรับการกรองข้อมูล เราเลือกเส้นทางเฉพาะด้วย งานประสบความสําเร็จ; นั่นคือรางวัลเล็กน้อยของ 1 ในระยะเวลาสุดท้าย เลีย et al. (2021) สุดท้าย H พัฒนาสถาปัตยกรรม The numbers reported for the Skill Mastery benchmark were collected by executing a model zero-shot that used an earlier version of the Gato architecture. Instead of the ResNet patch embedding, a similar architecture using a local transformer was used to embed image patch tokens. The local position embeddings and patch position embeddings were not used. These changes were implemented and found to improve Gato’s performance after the pretraining data was changed (as we decided to focus on Skill Generalization instead of Skill Mastery challenge), which is why they are presented as the final architecture of our full model. I เพิ่มเติม robotics ablations เราได้ดําเนินการชุดการลัดตัวในการจําลองเพื่อเข้าใจผลของข้อมูลการฝึกอบรมก่อนที่หลากหลายในสาขาหุ่นยนต์ (ดูรูป) เรารวมเส้นพื้นฐานเดียวกันเช่นเดียวกับในส่วน การเลือกตัวแปรขนาดพารามิเตอร์ 364M เช่นเดียวกับขั้นตอนพื้นฐานเพิ่มเติมที่ได้รับการฝึกอบรมด้วยข้อมูลชุดควบคุมเท่านั้น ตัวแทน DM Control-only เป็นตัวแทนที่ดีกว่าฐาน Gato ที่การถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่ายโอนการถ่าย 19). 5.2 J การมองเห็นความสนใจ เพื่อให้ตัวแปลงน้ําหนักความสนใจเราได้กู้คืนโลจิตความสนใจข้าม tensor กับมิติ ( ) where คือจํานวนหัวและ คือจํานวน tokens ในลําดับ หมายเลข ( )th entry of this matrix can be interpreted as the amount that head attends to token จาก token เนื่องจากแผนการ tokenization รูปภาพของ Gato มีหลาย tokens per timestep ดังนั้นเพื่อให้ความสนใจสําหรับช่วงเวลาที่เฉพาะเจาะจงเราได้ใช้ sub-matrix ที่สอดคล้องกับช่วงเวลานั้น จากนั้นเราได้ใช้ softmax ทั่วบรรทัดของแม่เหล็กนี้เพื่อปกติค่าที่เกี่ยวข้อง เนื่องจากเรามีความสนใจเฉพาะในการให้ความสนใจกับ tokens ก่อนหน้านี้เราได้ยกเว้นเส้นผ่าศูนย์กลางโดยการตั้งค่าเป็นความสิ้นสุดเชิงลบก่อน softmax H, T, T H T h, i, j h j i เพื่อวัดความสําคัญของแต่ละแพทช์เราเฉลี่ยน้ําหนักความสนใจเหนือคอลัมน์ที่สอดคล้องกัน เนื่องจากแมวใช้ตัวแปลงสาเหตุเมตริกซ์ความสนใจเป็นสามเหลี่ยมที่ต่ํากว่าดังนั้นเฉลี่ยจึงถูกพิจารณาเฉพาะเหนือคอลัมน์ใต้เส้นผ่าศูนย์กลางของเมตริกซ์ นี้สอดคล้องกับความสนใจเฉลี่ยที่จ่ายให้กับแพทช์ที่เฉพาะเจาะจงในช่วงเวลาทั้งหมด โดยใช้วิธีการนี้เราพบว่าแผนที่ความสนใจที่ชั้นแรกของเครื่องแปลงเป็นสิ่งที่สามารถอธิบายได้มากที่สุดซึ่งเห็นด้วยกับผลลัพธ์ของ หัวบางอย่างติดตาม entities และภูมิภาคเฉพาะงานของภาพอย่างชัดเจน รูปภาพ แสดงแผนที่ความสนใจสําหรับหัวที่เลือกด้วยตนเองในชั้นแรกสําหรับงานหลายงาน Abnar และ Zuidema (2020) 20 K ผลลัพธ์รายละเอียดสําหรับผู้ให้บริการ Meta-World ผู้เชี่ยวชาญ ตัวแทน Meta-World ผู้เชี่ยวชาญที่อธิบายไว้ในส่วน achieves 96.6% success rate averaged over all 50 Meta-World tasks. The detailed success rates are presented in Table We evaluated agent 500 times for each task. 5.5 7. L Per-domain ผลสําหรับ แมว เราอธิบายประสิทธิภาพของ Gato สําหรับงานควบคุมแบบจําลองในส่วน ตาราง เรานําเสนอผลลัพธ์ตามโดเมนตามมาตรฐาน เราประเมินตัวแทน 50 ครั้งสําหรับแต่ละงาน 4.1 8, กระดาษนี้สามารถใช้ได้ใน Archiv ภายใต้ใบอนุญาต CC by 4.0 Deed (Attribution 4.0 International) กระดาษนี้สามารถใช้ได้ใน Archiv ภายใต้ใบอนุญาต CC by 4.0 Deed (Attribution 4.0 International)