Self-driving vehicles and other autonomous systems use lidar (light detection and ranging) systems to gather 3D information about their environment. The vehicle can then carry out perception tasks in real time in order to avoid obstacles. However, lidar data can only be used after they have been annotated. This means that a massive amount of data has to be tagged and verified by humans which makes 3D point cloud annotation a challenging technique.
I currently work at Evocargo, a company providing cargo transportation service in supervised areas. For the service, we manufacture our own self-driving electric vehicles and develop software.
Needless to say, safe driving is our top priority, so we need our vehicles to detect the drivable area in point clouds obtained from lidars.
We don’t need to label every object in the scene. For example, we are not that interested in trees or buildings, but we still need to exclude them from the drivable area. Annotating one lidar scan for our purposes can take up to 20 minutes. Some tasks can be speeded up using advanced (and pricey) tools, but their use is not always economically viable.
Also annotation engineers have to label frames, which are almost identical, making for plenty of repetitive work. Every next frame has only a minimal shift compared to the one before, but the tools don’t let us project the same annotation onto them. Every new frame has to be processed manually.
We have challenged ourselves to optimize the annotation process without losing quality. We have used algorithms and the web-based Computer Vision Annotation Tool (CVAT, an open source) to make annotation faster and minimize repetitive operations for our engineers. Thanks to this work we have increased our annotation performance from a few frames to 100 frames per hour. In this article, I’ll describe how we do it.
The basic idea is simple: we segment a whole batch of lidar scans at once. First, we apply filters and segmentation to point clouds in lidar scans and build a lidar map. Then we transform the map to BEV (bird’s eye view) and annotate the drivable area in a 2D image using CVAT. Finally, we project this annotation onto lidar scans.
To better illustrate each of the stages, I split the workflow into granular operations. Most of them are performed automatically in the background, and only two are manual.
Let’s see how this works when we process a lidar scan of a scene containing cars, a person, an obstacle on the drivable area and various other objects.
We label vehicles and people in point clouds and define whether they are dynamic or static. Open-source tools such as SUSTechPoints can be used to do this. This is one of two operations that our annotation engineers have to carry out manually. We can use these annotations later on to train neural networks to detect people and cars.
A person in the image is colored red, and vehicles are green.
We automatically export the annotation made at stage 1 and run a segmentation algorithm — RANSAC or similar — on a point cloud.
However, the algorithm doesn’t work perfectly and identifies some parts of the curb and cars as the ground.
The quality is not good enough for use in AVs, so we further improve the segmentation result.
We project color from camera images of the same scene onto the ground points. This makes it easier to spot small objects and points, such as the curb and the pallet, which the algorithm had falsely assigned to the drivable area.
Now part of the drivable area has color — light grey mostly, as we have a road covered by a thin layer of snow. Some zones are left in yellow because they are out of the cameras’ view or the camera images are too noisy and we don’t want to rely on them. We keep this in mind when we return to coloring at Stage 5.
Using Open3D we cut out dynamic objects and run the ICP registration algorithm, which accurately combines all colored point clouds in one big point cloud.
We still have yellow points that are not matched with camera images by color. We know that they belong to the drivable area because they were defined by the segmentation algorithm in Stage 2 and because they were in camera range and were colored in other scans. So after joining scans into one map, we look for the closest colored point for every colorless point. If the distance to the colored point is smaller than the threshold that we set (a few centimeters), we project its color to the point.
Tree tops, wires, and other objects high above the ground may interfere with the view of the drivable area when looked at from above and make the work of the annotation engineers more difficult. So we filter all the points that are more than 1 meter above the ground and remove them from the map. All points that are higher than 30 centimeters and lower than 1 meter we color in violet.
At the same time, we transform the point cloud to BEV to get a 2D image that our annotation engineers can process later in CVAT.
After transforming the image to BEV, we apply the Median Filter, which removes tiny gaps in the roadbed and makes the image smoother. The effect is hard to see on the snowy road image we are working with in this article, so here are images of the roadbed from another scene that illustrate it better.
Our annotation engineers label the drivable area in CVAT. This is the second operation that they have to perform manually.
We project the 2D polygon from CVAT annotation onto the point cloud. For this we transform the 2D CVAT polygon to a polygon mesh. We are only interested in the points belonging to the drivable area. Therefore we label points below 0.7 meters and above -0.3 meters as the ground and remove all other points that are out of this range.
This is our resulting point cloud with drivable area segmentation.
The batch annotation workflow I’ve described in this article can be applied to any autonomous system. It speeds up annotation multiple times. We have boosted our performance from a few to one hundred lidar scans per hour. Smaller projects can also benefit from this workflow method that doesn’t require pricey tools for segmentation of the drivable area in point clouds.
Our annotation engineers no longer have to carry out repetitive work, and the annotated dataset can be used to train neural networks to detect objects.