Using Reinforcement Learning to Build a Self-Learning Grasping Robot  by@bayaz

Using Reinforcement Learning to Build a Self-Learning Grasping Robot

Read on Terminal Reader
react to story with heart
react to story with light
react to story with boat
react to story with money
Using Reinforcement Learning to Build a Self-Learning Grasping Robot, Baris Yazici writes about a self-learning robot. The project was inspired by a year of working on autonomous robotics manipulation. It is always hard to start a big project which requires many moving parts. I want to pass the knowledge I gathered through this process to help others overcome the initial inertia. In the beginning, it was tough for me to judge the difficulty of different components of the problem. I underestimated the effort required for some parts and overestimated others.
baris yazici HackerNoon profile picture

baris yazici

Teaching robots to manipulate objects. Also rock climbing

In this post, I will explain my experience over the course of a year of working with Reinforcement Learning (RL) on autonomous robotics manipulation. It is always hard to start a big project which requires many moving parts. It was undoubtedly the same in this project. I want to pass the knowledge I gathered through this process to help others overcome the initial inertia.

In the beginning, it was tough for me to judge the difficulty of different components of the problem. I underestimated the effort required for some parts and overestimated others. I will explain how RL can be cumbersome and straightforward to work with at the same time.

The below video shows how our agent can transform the learned policy to different domains and different environments.

Master’s thesis is accessible:

And the link to the GitHub repo:


Humanlike manipulation is a challenging task for today’s robots. Robots are still performing based on a manually designed controller specifically designed for only one problem at hand. They lack the following skills:

  1. Performing in unseen environments
  2. No adaptation to new environments or objects
  3. No generalized representation of object manipulation

Robots in the future should know how to manipulate a Rubik cube and kitchen appliances at the same time. They should not need supervision to learn new manipulation skills. The Reinforcement Learning (RL) framework promises end-to-end learning of these skills with no hand-coded controller design.


Reinforcement Learning is a robust framework to learn complex behaviors. It has already shown great success on Atari games and locomotion problems. Significantly, the underactuated motions like tying shoelaces or wearing a shirt are hard to model and control with traditional methods [1]. RL can tackle these problems by sampling in a simulation and optimizing for the maximum reward.

The biggest challenges for RL are [2]:

  1. Sample efficiency
  2. Hyperparameter sensitivity

The following sections will address these problems to achieve an autonomous robotics manipulator in a simulation. The below components helped to tackle the challenges mentioned above.


  1. Curriculum learning
  2. Shaped reward
  3. Input and reward normalization
  4. Raw depth sensor data as input
  5. Off-policy maximum entropy RL framework — e.g. SAC

Experimental Setup

First of all, let’s go through our problem definition. Our training environment features a gripper attempting to grasp randomly drawn objects from the floor. The gripper is deprived of an arm and a base. Accordingly, the computation of inverse kinematics is ignored. All training runs and experiments were done in PyBullet simulation [3].

The observed state originates from the camera mounted on the gripper. The gripper is position controlled with continuous input between -1 to 1. An action is represented by [x, y, z, yaw angle, gripper open/close]. The inputted action represents the relative movement based on the position of the gripper. Episode terminates based on the termination goal of the training task; either when an agent lifts an object from clutter or when it successfully picks all objects from the ground.



Observation is the eyes of the RL agent. Humans see and touch objects to map the environment in their brains. Like us, the RL agent also needs some input to store the environment’s state.

There is a couple of go-to perception types. Some examples are RGB images and autoencoder from [4][5]. Besides, we implemented depth image observation, which performed better than both of the reference works. We compared the performance of these observation types:

  1. RGB-D,
  2. Depth image and
  3. Autoencoder

The below diagram shows how we processed the depth observation from the environment to the learning algorithm. Our simulation environment returns observation with two components:

  1. Depth image — shape: 64, 64, 1
  2. Gripper width — original shape: 1 tiled to shape: 64, 64, 1:

We separated the depth image observation from the gripper width. We then fed the depth image into the convolutional network with a fully connected layer at the end. Finally, we concatenated the processed depth image with gripper width information, which returned the shape of 513.



In contrast to supervised learning, RL creates its data to optimize. When the data was created, it may not point to the high reward region; optimizing it will not lead to good grasping behavior. Imagine optimizing the image-net with falsely labeled images; naturally, it won’t perform right [6].

RL has both advantages and disadvantages when it comes to data creation. In the RL setting, the data is expensive. It takes many simulator or physical robot iterations to create the data. But we do not need to label the data. Therefore, a well-defined agent can explore the environment on its own. In supervised learning, grasping scenarios need to be modeled tediously and labeled, which is quite challenging when the optimal policy is stochastic.

The policy will render good data and optimizing this data will lead to a better policy [4]. While we strongly rely on the agent’s random actions for good data, it might never explore the environment in a comprehensive way, leading to an incompetent policy due to the bad data.

So, we aim to incentivize the agent to the good data region as fast as possible. For this purpose, we used the following techniques:

  1. Curriculum learning
  2. Shaped reward function
  3. Off-policy RL algorithm (SAC)

They both contributed to the sample efficiency by creating more meaningful data early on in training.

Curriculum Learning


Curriculum learning governs the difficulty of the environment to facilitate learning. Like our school curriculum, first teaching arithmetic and later introducing differential math. Our curriculum strategy gets more challenging with the success rate of the RL agent. Curriculum strategy modifies the following environment features:

  1. Lifting height,
  2. Object count and the
  3. Limits of the workspace area.

In the beginning, it is simpler for the agent to explore the terminal state and the intermediate goals in a comfortable setting. When we slightly change the terminal state, it can still extrapolate from what it already knew to a harder environment.


Curriculum learning modifies the difficulty of the environment based on the RL agent’s success rate.


Agent’s success rate with curriculum learning and without curriculum learning.

Shaped Reward


The shape reward function has the same purpose as curriculum learning. It motivates the agent to explore the high reward region. Through intermediate rewards, it steers the agent to the terminal state.

Agent receives an intermediate reward when it grasps an object. As soon as the agent lifts the object to a terminal state, it gets the terminal reward. We apply a time penalty until it reaches the terminal state. The sum of the intermediate reward and the time penalty must stay smaller than zero until the terminal state. Otherwise, the agent would exploit the intermediate reward and wait until the episode’s end to get to the terminal state.

As mentioned before, a shaped reward serves to lead the agent to the good data region. Good data provides better policy, and they reinforce each other during the learning to deliver the optimal policy.


I think of the normalization as the activator of the observation and shaped reward functions. Without the normalization, the agent is unlikely to make sense of the input and rewards that are fed to the neural nets. Especially when the input has different components, and the reward isn’t sparse.

Our environment’s state is composed of depth sensor input and the gripper width information. Unnormalized state representation can lead to a false emphasis on the state components, giving more weight to the gripper width information than the depth-sensor data or vice-versa. Normalization helps to scale the observation components to the same level.


Learning Algorithm - SAC

“RL uses training information that evaluates the actions taken rather than instructs by giving correct actions — This is what creates the need for active exploration, for an explicit search for good behavior.” – Introduction to Reinforcement Learning — R. Sutton

Exploration is innate in RL. The uncertainty on the estimation of the action values is unavoidable. Especially in our environment, where reward distribution over actions has a huge variance, we need to apply a sophisticated exploration strategy [7].

SAC masters the exploration-exploitation trade-off in RL. We expect an RL algorithm to find a balance between exploring and exploiting. This optimal balance could mean finding the optimal policy or stuck at sub-optimal policies.

Exploration states how flexible it is to try new actions, while exploitation is how confident it is to take a specific action. In most cases, those two concepts are firmly connected. If we explore enough, we could find newer, better actions that return more rewards. Still, if we are confident enough about the action-value estimation, we should stop exploring and start exploiting the greedy actions.

SAC models the RL problem not just for the expected reward maximization but also the expected entropy at the same time. This nature provides the following advantages:

  1. Optimum entropy provides enhanced exploration behavior
  2. Reduced hyperparameter sensitivity

Entropy maximization RL framework optimizes both for reward and entropy at the same time

SAC is the most robust algorithm we used. It required minimal hyperparameter tuning and sampling. The off-policy nature of the SAC algorithm enables us to use the samples from different policies. Therefore, we can store the samples in a replay buffer and use it as many times as possible. Similar to supervised learning, we draw batches of samples to find optimal actions. We stored the size of 1 million samples in the buffer, which allocated around 50GB of RAM. Be careful if you want to replicate our results; check if you have enough ram on your machine.

Off-policy algorithms proved to be more sample efficient than on-policy RL counterparts, where we throw away the data, we use each episode, and create new experiences for new episodes.

Training Setup


We have two different training scenarios:

  1. Single object picking from clutter
  2. Table clearing

In single object picking from clutter setup, the gripper needs to pick one random object to a predefined threshold to end the episode successfully. And for the table cleaning setup, it needs to pick each object in the environment to the same height threshold.

Both scenes required different hyperparameters. For example, we needed to decrease the start object count from three to one for clearing the table scene. Also, neural network layers and buffer size are increased to match the increased complexity of the behavior.

We aim to get the most generalized grasping model. This model should perform well with unseen objects and adapt to new domains. That’s why we designed two test environments.

One scene with objects in a tray on the table. Kuka robot on the table and we mount our trained gripper model to the last link of the robot.

With these different test setups, we can assess if the model generalizes and adapts new scenes and domains.



1. Depth Sensor Input Performs the Best:

We tested with both autoencoder, depth, and RGB-D input. Based on our tests, depth input performed the best. We believe the difference between autoencoder and depth perception lies in the interpretation loss of the depth image. Autoencoders compress the observation onto a latent-space. This compression causes the agent to misinterpret the depth of the objects.

On the other hand, the depth perception layer is an online method; therefore, it corrects its network weights when a wrong interpretation occurs. The online perception layer also complies with the end-to-end nature of the RL framework. Our depth perception layer’s weights are updated to deliver a better-grasping policy; autoencoder’s weights are immutable throughout learning.

Depth converged to a greater success rate than auto-encoder perception

Depth converged faster and to a greater success rate than RGBD perception

2. Buffer Size Matters:

Although SAC is robust to different hyperparameter selections, we still updated some of the learning parameters to achieve a more significant result. Such as the buffer size. Buffer size is a critical hyperparameter, which directly affects the performance of the agent. The agent needs large enough samples/experiences in the buffer to learn, similar to supervised learning datasets. Usually, with complex behaviors, where exploration is a big part of the learning, a larger buffer size is meaningful.

1m buffer converged to a better success rate and was subjected to a less variance than 50k buffer

3. Table Clearing Task vs. Single Object Picking from Clutter:

Different manipulation skills demand different hyperparameter tuning. More complex behaviors require a large buffer size and neural network layers. For example, the hyperparameters we used for single object picking from clutter did not work correctly for the table clearing task. We needed to increase the buffer size from 1m to 2m, and the neural network layer size from 64 to 128.

Aside from the neural network’s hyperparameters, we also changed the curriculum strategy’s object count parameter from three to one. Agent in table clearing task couldn’t explore the terminal state with three objects at the beginning of the training. Therefore, we had to decrease the object count to smoothen the transition from an easy setting to a more challenging environment.


To sum up, in this article we covered how to approach the robotic bin picking problem with the help of RL. We mentioned the importance of:

  • Leading the agent to the good data region as fast as possible
  • Leveraging old experiences with off-policy updates
  • Normalization of the observation and reward
  • Raw depth pixel as observation to ensure end-to-end learning

In general, RL can be cumbersome to work with because it’s hard to debug. It’s always good to start out simple. Try implementing a simplified version of your custom environment. First, check that baseline RL algorithms can learn the simplified version. And then, gradually make the environment harder to see which parameters make the RL agent struggle to learn. This way you can guarantee that the agent’s learning will not be bottlenecked, and you will not be stressed out to see your agent suffering :)

For readers interested to learn more about RL can check out the below resources:


[1]Russ Tedrake. Underactuated Robotics: Algorithms for Walking, Running, Swimming, Flying, and Manipulation (Course Notes for MIT 6.832). Downloaded on 19/12/2020 from

[2]Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications.

[3]E. Coumans and Y. Bai. PyBullet, a Python module for physics simulation for games, robotics and machine learning. 2016–2020.

[4]Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., & Levine, S. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. CoRL, 1–23.

[5]Breyer, M., Furrer, F., Novkovic, T., Siegwart, R., & Nieto, J. (2018). Comparing Task Simplifications to Learn Closed-Loop Object Picking Using Deep Reinforcement Learning. IEEE Robotics and Automation Letters, 4(2), 1549–1556.

[6]Eysenbach B., Kumar A., Gupta A., (2020, 10, 13), Reinforcement learning is supervised learning on optimized data,,

[7]Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning, Second Edition: An Introduction — Complete Draft. In The MIT Press.

Also published at

react to story with heart
react to story with light
react to story with boat
react to story with money
. . . comments & more!