In this post, I will explain my experience over the course of a year of working with Reinforcement Learning (RL) on autonomous robotics manipulation. It is always hard to start a big project which requires many moving parts. It was undoubtedly the same in this project. I want to pass the knowledge I gathered through this process to help others overcome the initial inertia. In the beginning, it was tough for me to judge the difficulty of different components of the problem. I underestimated the effort required for some parts and overestimated others. I will explain how RL can be cumbersome and straightforward to work with at the same time. The below video shows how our agent can transform the learned policy to different domains and different environments. Master’s thesis is accessible: https://github.com/BarisYazici/tum_masters_thesis/blob/master/final_report.pdf And the link to the GitHub repo: https://github.com/BarisYazici/deep-rl-grasping MANIPULATION is a challenging task for today’s robots. Robots are still performing based on a manually designed controller specifically designed for only one problem at hand. They lack the following skills: Humanlike manipulation Performing in unseen environments No adaptation to new environments or objects No generalized representation of object manipulation Robots in the future should know how to manipulate a Rubik cube and kitchen appliances at the same time. They should not need supervision to learn new manipulation skills. The Reinforcement Learning (RL) framework promises end-to-end learning of these skills with no hand-coded controller design. REINFORCEMENT LEARNING Reinforcement Learning is a robust framework to learn complex behaviors. It has already shown great success on Atari games and locomotion problems. Significantly, the underactuated motions like tying shoelaces or wearing a shirt are and [1]. RL can tackle these problems by and . hard to model control with traditional methods sampling in a simulation optimizing for the maximum reward The biggest challenges for RL are [2]: Sample efficiency Hyperparameter sensitivity The following sections will address these problems to achieve an autonomous robotics manipulator in a simulation. The below components helped to tackle the challenges mentioned above. KEY TO SUCCESS: Curriculum learning Shaped reward Input and reward normalization Raw depth sensor data as input Off-policy maximum entropy RL framework — e.g. SAC Experimental Setup First of all, let’s go through our problem definition. Our training environment features a gripper attempting to grasp randomly drawn objects from the floor. The gripper is deprived of an arm and a base. Accordingly, the computation of inverse kinematics is ignored. All training runs and experiments were done in [3]. PyBullet simulation The observed state originates from the camera mounted on the gripper. The gripper is position controlled with continuous input between -1 to 1. An action is represented by [x, y, z, yaw angle, gripper open/close]. The inputted action represents the relative movement based on the position of the gripper. Episode terminates based on the termination goal of the training task; either when an agent lifts an object from clutter or when it successfully picks all objects from the ground. Observation Observation is the eyes of the RL agent. Humans see and touch objects to map the environment in their brains. Like us, the RL agent also needs some input to store the environment’s state. There is a couple of go-to perception types. Some examples are RGB images and autoencoder from [4][5]. Besides, we implemented depth image observation, which performed better than both of the reference works. We compared the performance of these observation types: , RGB-D and Depth image Autoencoder The below diagram shows how we processed the depth observation from the environment to the learning algorithm. Our simulation environment returns observation with two components: Depth image — shape: 64, 64, 1 Gripper width — original shape: 1 shape: 64, 64, 1: tiled to We separated the depth image observation from the gripper width. We then fed the depth image into the convolutional network with a fully connected layer at the end. Finally, we concatenated the processed depth image with gripper width information, which returned the shape of 513. SAMPLE EFFICIENCY In contrast to supervised learning, RL creates its data to optimize. When the data was created, it may not point to the ; optimizing it will not lead to good grasping behavior. Imagine optimizing the image-net with falsely labeled images; naturally, it won’t perform right [6]. high reward region RL has both advantages and disadvantages when it comes to data creation. In the RL setting, the . It takes many simulator or physical robot iterations to create the data. But we do not need to label the data. Therefore, a well-defined agent can explore the environment on its own. In supervised learning, grasping scenarios need to be modeled tediously and labeled, which is quite challenging when the optimal policy is stochastic. data is expensive The policy will render good data and optimizing this data will lead to a better policy [4]. While we strongly rely on the agent’s random actions for good data, it might never explore the environment in a comprehensive way, leading to an incompetent policy due to the bad data. So, we aim to incentivize the agent to the as fast as possible. For this purpose, we used the following techniques: good data region Curriculum learning Shaped reward function Off-policy RL algorithm (SAC) They both contributed to the sample efficiency by creating more early on in training. meaningful data Curriculum Learning Curriculum learning governs the of the environment to . Like our school curriculum, first teaching arithmetic and later introducing differential math. Our curriculum strategy gets more challenging with the . Curriculum strategy modifies the following environment features: difficulty facilitate learning success rate of the RL agent Lifting height, and the Object count . Limits of the workspace area In the beginning, it is simpler for the agent to explore the and the in a comfortable setting. When we slightly change the terminal state, it can still extrapolate from what it already knew to a harder environment. terminal state intermediate goals Curriculum learning modifies the difficulty of the environment based on the RL agent’s success rate. Agent’s success rate with curriculum learning and without curriculum learning. Shaped Reward The shape reward function has the same purpose as curriculum learning. It motivates the agent to explore the Through intermediate rewards, it steers the agent to the . high reward region. terminal state Agent receives an when it . As soon as the agent lifts the object to a terminal state, it gets the . We apply a until it reaches the terminal state. The and must stay . Otherwise, the agent would exploit the intermediate reward and wait until the episode’s end to get to the terminal state. intermediate reward grasps an object terminal reward time penalty sum of the intermediate reward the time penalty smaller than zero until the terminal state As mentioned before, a shaped reward serves to lead the agent to the good data region. Good data provides better policy, and they reinforce each other during the learning to deliver the optimal policy. Normalization I think of the normalization as the activator of the observation and shaped reward functions. Without the normalization, the agent is unlikely to make sense of the input and rewards that are fed to the neural nets. Especially when the input has different components, and the reward isn’t sparse. Our environment’s state is composed of depth sensor input and the gripper width information. Unnormalized state representation can lead to a false emphasis on the state components, giving more weight to the gripper width information than the depth-sensor data or vice-versa. Normalization helps to scale the observation components to the same level. Learning Algorithm - SAC “RL uses training information that evaluates the actions taken rather than instructs by giving correct actions — This is what creates the need for active exploration, for an explicit search for good behavior.” – Introduction to Reinforcement Learning — R. Sutton Exploration is innate in RL. The uncertainty on the estimation of the action values is unavoidable. Especially in our environment, where reward distribution over actions has a huge variance, we need to apply a sophisticated exploration strategy [7]. SAC masters in RL. We expect an RL algorithm to find a balance between exploring and exploiting. This optimal balance could mean finding the optimal policy or stuck at sub-optimal policies. the exploration-exploitation trade-off Exploration states how flexible it is to try new actions, while exploitation is how confident it is to take a specific action. In most cases, those two concepts are firmly connected. If we explore enough, we could find newer, better actions that return more rewards. Still, if we are confident enough about the action-value estimation, we should stop exploring and start exploiting the greedy actions. SAC models the RL problem not just for the expected reward maximization but also the expected entropy at the same time. This nature provides the following advantages: Optimum entropy provides enhanced exploration behavior Reduced hyperparameter sensitivity Entropy maximization RL framework optimizes both for reward and entropy at the same time SAC is the most robust algorithm we used. It required minimal hyperparameter tuning and sampling. The off-policy nature of the SAC algorithm enables us to use the samples from different policies. Therefore, we can store the samples in a replay buffer and use it as many times as possible. Similar to supervised learning, we draw batches of samples to find optimal actions. We stored the size of in the buffer, which allocated around . Be careful if you want to replicate our results; check if you have enough ram on your machine. 1 million samples 50GB of RAM Off-policy algorithms proved to be more sample efficient than on-policy RL counterparts, where we throw away the data, we use each episode, and create new experiences for new episodes. Training Setup We have two different training scenarios: Single object picking from clutter Table clearing In single object picking from clutter setup, the gripper needs to pick one random object to a predefined threshold to end the episode successfully. And for the table cleaning setup, it needs to pick each object in the environment to the same height threshold. Both scenes required different hyperparameters. For example, we needed to decrease the start object count from three to one for scene. Also, and are increased to match the increased complexity of the behavior. clearing the table neural network layers buffer size We aim to get the most generalized grasping model. This model should perform well with and adapt to . That’s why we designed two unseen objects new domains test environments. One scene with objects in a tray on the table. Kuka robot on the table and we mount our trained gripper model to the last link of the robot. With these different test setups, we can assess if the model generalizes and adapts and . new scenes domains RESULTS 1. Depth Sensor Input Performs the Best: We tested with both autoencoder, depth, and RGB-D input. Based on our tests, depth input performed the best. We believe the difference between autoencoder and depth perception lies in the interpretation loss of the depth image. Autoencoders compress the observation onto a latent-space. This compression causes the agent to . misinterpret the depth of the objects On the other hand, the depth perception layer is an ; therefore, it corrects its network weights when a wrong interpretation occurs. The online perception layer also complies with the end-to-end nature of the RL framework. Our depth perception layer’s to deliver a better-grasping policy autoencoder’s throughout learning. online method weights are updated ; weights are immutable Depth converged to a greater success rate than auto-encoder perception Depth converged faster and to a greater success rate than RGBD perception 2. Buffer Size Matters: Although SAC is robust to different hyperparameter selections, we still updated some of the learning parameters to achieve a more significant result. Such as the buffer size. Buffer size is a critical hyperparameter, which directly affects the performance of the agent. The agent needs large enough samples/experiences in the buffer to learn, similar to supervised learning datasets. Usually, with complex behaviors, where a larger buffer size is meaningful. exploration is a big part of the learning, 1m buffer converged to a better success rate and was subjected to a less variance than 50k buffer 3. Table Clearing Task vs. Single Object Picking from Clutter: Different manipulation skills demand different hyperparameter tuning. More complex behaviors require a and . For example, the hyperparameters we used for from clutter did not work correctly for the . We needed to increase the buffer size from , and the neural network layer size from . large buffer size neural network layers single object picking table clearing task 1m to 2m 64 to 128 Aside from the neural network’s hyperparameters, we also changed the curriculum strategy’s parameter from . Agent in table clearing task couldn’t explore the terminal state with at the beginning of the training. Therefore, we had to decrease the object count to smoothen the transition from an easy setting to a more challenging environment. object count three to one three objects SUMMARY To sum up, in this article we covered how to approach the robotic bin picking problem with the help of RL. We mentioned the importance of: Leading the agent to the good data region as fast as possible Leveraging old experiences with off-policy updates Normalization of the observation and reward Raw depth pixel as observation to ensure end-to-end learning In general, RL can be cumbersome to work with because it’s hard to debug. It’s always good to start out simple. Try implementing a simplified version of your custom environment. First, check that baseline RL algorithms can learn the simplified version. And then, gradually make the environment harder to see which parameters make the RL agent struggle to learn. This way you can guarantee that the agent’s learning will not be bottlenecked, and you will not be stressed out to see your agent suffering :) For readers interested to learn more about RL can check out the below : resources John Schulman gives practical advice for RL training — https://www.youtube.com/watch?v=8EcdaCk9KaQ&t=2409s Deep RL Bootcamp: https://sites.google.com/view/deep-rl-bootcamp/lectures Berkeley RL course from Sergey Levine: http://rail.eecs.berkeley.edu/deeprlcourse/ From Prof. Pieter Abbeel: https://youtu.be/OMraS0GRWK0?t=1258 The legendary course from David Silver course: https://youtu.be/2pWv7GOvuf0 Blog post from Andrej Karpathy: http://karpathy.github.io/2016/05/31/rl/ Russ Tedrake underactuated robotics course: http://underactuated.csail.mit.edu/rl_policy_search.html References [1]Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6.832). Downloaded on 19/12/2020 from http://underactuated.mit.edu/ [2]Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. http://arxiv.org/abs/1812.05905 [3]E. Coumans and Y. Bai. PyBullet, a Python module for physics simulation for games, robotics and machine learning. 2016–2020. http://pybullet.org. [4]Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., & Levine, S. (2018). . , 1–23. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation CoRL http://arxiv.org/abs/1806.10293 [5]Breyer, M., Furrer, F., Novkovic, T., Siegwart, R., & Nieto, J. (2018). Comparing Task Simplifications to Learn Closed-Loop Object Picking Using Deep Reinforcement Learning. , (2), 1549–1556. IEEE Robotics and Automation Letters 4 https://doi.org/10.1109/LRA.2019.2896467 [6]Eysenbach B., Kumar A., Gupta A., (2020, 10, 13), Reinforcement learning is supervised learning on optimized data, bair.berkeley.edu, https://bair.berkeley.edu/blog/2020/10/13/supervised-rl/ [7]Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning, Second Edition: An Introduction — Complete Draft. In The MIT Press. Also published at https://towardsdatascience.com/sample-efficient-robot-training-on-pybullet-simulation-with-sac-algorithm-71d5d1d4587f