Table of Links
3.2 Learning Residual Policies from Online Correction
3.3 An Integrated Deployment Framework and 3.4 Implementation Details
4.2 Quantitative Comparison on Four Assembly Tasks
4.3 Effectiveness in Addressing Different Sim-to-Real Gaps (Q4)
4.4 Scalability with Human Effort (Q5) and 4.5 Intriguing Properties and Emergent Behaviors (Q6)
6 Conclusion and Limitations, Acknowledgments, and References
A. Simulation Training Details
B. Real-World Learning Details
C. Experiment Settings and Evaluation Details
D. Additional Experiment Results
3.3 An Integrated Deployment Framework
3.4 Implementation Details
We use Isaac Gym [10] as the simulation backend. Proximal policy optimization (PPO [84]) is used to train teacher policies from scratch. We design task-specific reward functions and curricula when necessary to facilitate RL training. We apply exhaustive domain randomization during teacher policy training and proper data augmentation during student policy distillation. Student policies are parameterized as Gaussian Mixture Models (GMMs [68]). We have also experimented with other state-of-the-art policy models, such as Diffusion Policy [85], but did not observe better performances. See the Appendix Sec. A for more details about the simulation training phase and additional comparisons. During the human-in-the-loop data collection phase, we use a 3Dconnexion SpaceMouse as the teleoperation interface. Residual policies use state-of-the-art point cloud encoders, such as PointNet [86] and Perceiver [87, 88], and GMM as the action head. We follow the best practices to train residual policies, including using learning rate warm-up and cosine annealing [89]. More training hyperparameters are provided in the Appendix Sec. B.4.
Authors:
(1) Yunfan Jiang, Department of Computer Science;
(2) Chen Wang, Department of Computer Science;
(3) Ruohan Zhang, Department of Computer Science and Institute for Human-Centered AI (HAI);
(4) Jiajun Wu, Department of Computer Science and Institute for Human-Centered AI (HAI);
(5) Li Fei-Fei, Department of Computer Science and Institute for Human-Centered AI (HAI).
This paper is