An example of a C++ pipeline for LiDAR-based perception
Autonomous driving is a relatively new and very fascinating field of modern technology. Publicly showcased during the DARPA Grand Challenge in 2004 and moving to a significantly more challenging city environment during the Urban Challenge of 2007, autonomous driving ever since was pursued by industry and academia.
The applications have been swerving between personal self-driving cars, autonomous taxi fleet, trucking, delivery, and whatnot. Still, the technology is not there yet. One of the reasons for autonomous driving falling into a disillusionment trough is that the perception component of the software driving stack is a very complex problem. There are attempts to build the camera-first perception with prominent followers being Tesla and Wayve, though the majority of teams go with LiDAR-based perception.
Solutions that rely on the LiDAR can also be split into two categories: traditional computer vision algorithms to process point clouds and deep learning-based methods. Neural networks promise to solve perception with high average accuracy. However, it is not sufficient if we want to demonstrate reasonable accuracy in the worst-case scenario.
In the automotive industry, when the software must meet ASIL D level of reliability, it is always good to have an independent traditional, non-ML perception stack running on a vehicle and providing a high level of compliance with automotive standards. In this article, we are going to take a look at the implementation of an AD stack made with the help of PCL — an open-source Point Cloud Library.
For starters, we are going to adhere to a test-driven development (TDD) on the system level to make sure our entire code is thoroughly tested before the first field deployment happens. To do this, we need a dataset to run our code against. The classical 2012 dataset Kitti by Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago will fit this purpose very well.
It is one of the first sizable high-quality datasets gathered to serve as a benchmark for computer vision algorithms in the autonomous driving field. Kitti features several sub-datasets, including Detection, but we are more interested in the Tracking sub-dataset since we are going to leverage the temporal properties of data as well as GNSS localization data.
Kitti Tracking consists of 21 sequences of synchronized PNG images, Velodyne LiDAR scans, and NMEA records from the RT3003 GPS-IMU module. An important feature of the dataset is thorough mutual calibration between the sensors, including matrix “Tr_imu_velo” which is a transformation from the GPS-IMU coordinates to Velodyne LiDAR coordinates.
The architecture of the perception pipeline is as follows.
Let’s talk about each component separately and dig deeper into their C++ implementation.
Why may we need to decimate the point cloud coming from our depth sensors which could be one or several LiDARs? The most important requirement for autonomous driving software is the satisfaction of the real-time operation constraints.
The first requirement is for the processing pipeline to keep up with the rate of LIDAR scan sampling. The scan rate may vary from 10 to 25 scans per second in real-life cases which leads to the maximal latency of any processing stage ranging from 100 ms to 40 ms. If for any possible incoming point cloud any of the stages, for example, clusterization, lags over 100 ms (for 10 scans per second rate), either a frame drop will happen, or the total latency of the pipeline will start to grow arbitrarily.
One of the solutions here is to drop LiDAR points instead of dropping entire frames. This will gradually decrease accuracy metrics (recall and precision) and keep the pipeline operating in real-time. The second requirement is the overall latency or the reaction time of the system. Again, the total latency should be limited to at least 100 or 200 ms. Reaction times of 500 ms or even so 1 second are not acceptable for autonomous driving. Consequently, it makes sense to start the algorithm design by working with smaller amounts of LiDAR points by applying the decimation first.
The standard options for decimation are:
Regular downsampling is fast but may lead to aliasing patterns on the point cloud. Random or pseudorandom downsampling is also fast but may lead to the unpredictable complete disappearance of small objects. The voxel grid filtering like PCL’s pcl::VoxelGrid<> class is smart and adaptive but takes extra compute and memory. As always, selecting one method over the other is a design choice.
Multi-scan aggregation is a process of registering several historical LiDAR scans to the common coordinate system while the ego-vehicle is moving relative to the ground. The common coordinate system can be either a local navigational frame (also known as ego-master coordinates) or the current LiDAR sensor coordinates.
We will be using the latter as an example. This stage is theoretically optional, however is very important in practice. The problem is that the following stage of clusterization is dependent on the density of the LiDAR points and if the density is insufficient, the effect of over-clusterization may come into play. Over-clusterization means that any object (a car, a bus, a building wall, etc) may be split into several pieces.
In itself, it may not be a problem for detecting obstacles, however, for the module that is downstream of perception — tracking — over-clusterization imposes a substantial challenge. The tracker may inaccurately associate the pieces of an object and produce a track that will be extrapolated by the prediction module in the direction that crosses the ego-vehicle path, which in its turn makes the ego car brake abruptly. We definitely do not want small errors in clusterization to create an avalanche of errors in the downstream components.
Aggregation of several consecutive scans (5 to 10) proportionally increases the density of LiDAR points falling on each object and facilitates accurate clusterization. One nice peculiarity about ego-vehicle movement is that the car is able to observe the same object from different views and that the LiDAR scanning pattern covers different parts of the object.
Let’s take a look at the code that performs the aggregation. The first stage is to keep a queue of bound length wich historical point clouds along with the pose transformations between the subsequent scanner poses. Notice how we use both the translational velocity [Vx, Vy] and the rotational velocity Wz obtained from the RT3003 GPS-IMU module to construct the pose transformation.
// We accumulate the incoming scans along with their localization metadata
// into a deque to perform subsequent aggregation.
{
Transform3f next_veh_pose_vs_curr = Transform3f::Identity();
if (gpsimu_ptr)
{
float frame_interval_sec = 0.1f;
// First, we need to calculate yaw change given the yaw rate
// (angular speed over Z axis) and the time inteval between frames.
float angle_z = gpsimu_ptr->wz * frame_interval_sec;
auto rot = Eigen::AngleAxisf(angle_z, Eigen::Vector3f::UnitZ());
next_veh_pose_vs_curr.rotate(rot);
// Second, we need a translation transform to the next frame
// given the speed of the ego-vehicle and the frame interval.
next_veh_pose_vs_curr.translate(Eigen::Vector3f(
gpsimu_ptr->vf * frame_interval_sec,
gpsimu_ptr->vl * frame_interval_sec,
0.0f
));
}
// Since later we want to aggregate all scans into the coordinate
// frame of the last scans, we need the inverse transform.
auto curr_veh_pose_vs_next = next_veh_pose_vs_curr.inverse();
// Put the resulting pair of the cloud and the transform into a queue.
auto cloud_and_metadata = CloudAndMetadata{decimated_cloud_ptr, curr_veh_pose_vs_next};
m_queue.push_back(cloud_and_metadata);
while (m_queue.size() > m_params->m_num_clouds)
{
m_queue.pop_front();
}
}
In the second stage, we traverse the queue from the latest scan backward in time, accumulating the scan-to-scan transformations into the aggregate transform and applying the aggregate transform to each historical frame. With this approach we make the computational cost be O(N*D) where N is the number of points, D — the depth of the history (the number of scans).
// We accumulate the transforms starting from the latest back in time and
// transform each historical point cloud into the coordinates of the current frame.
auto aggregated_cloud_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZI> >();
Eigen::Matrix4f aggragated_transform = Eigen::Matrix4f::Identity();
for (int i = m_queue.size()-1; i >= 0; i--)
{
const auto& cloud_and_metadata = m_queue[i];
const auto& cloud_ptr = cloud_and_metadata.cloud_ptr;
const auto& trans = cloud_and_metadata.transform_to_next;
pcl::PointCloud<pcl::PointXYZI>::Ptr transformed_cloud_ptr;
if (i != m_queue.size()-1)
{
aggragated_transform *= trans.matrix();
transformed_cloud_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZI> >();
pcl::transformPointCloud(*cloud_ptr, *transformed_cloud_ptr, aggragated_transform);
}
else
{
// For the current scan no need to transform
transformed_cloud_ptr = cloud_ptr;
}
// Concatenate the transformed point cloud into the aggregate cloud
*aggregated_cloud_ptr += *transformed_cloud_ptr;
}
After the aggregation the point clouds if the moving objects will seem somewhat smeared. This is a setback that can be dealt with further down the road at the clusterization step. What we need to have at this stage is a more dense point cloud that accumulates information from multiple frames.
The purpose of the perception stack is to provide information about the dynamic objects and stationary obstacles. A car is supposed to drive on the road and normally the road surface is not considered an obstacle. Thus we can remove all the LiDAR points that got reflected from the road surface. To do this we first detect the ground as a plane or as a curved surface and remove all points that are around say 10 centimeters around or below the surface. There are several ways to detect the ground on the point cloud:
Let’s take a deeper look at the C++ implementation of Ransac with the help of Eigen and PCL libraries. First of all, let’s define our candidate plane. We will use the form of the base point plus the normal vector.
// A plane is represented with a point on the plane (base_point)
// and a normal vector to the plane.
struct Plane
{
Eigen::Vector3f base_point;
Eigen::Vector3f normal;
EIGEN_MAKE_ALIGNED_OPERATOR_NEW
};
Then we define a helper function that allows us to find indices of all points that satisfy the condition on the Z coordinate after the point cloud is transformed to the plane coordinates. The condition is given by a lambda function since we would like this helper to be as generic as required. Please follow the comments in the code to learn the details of the implementation.
// This helper function finds indices of points that are considered inliers,
// given a plane description and a condition on distance from the plane.
std::vector<size_t> find_inlier_indices(
const pcl::PointCloud<pcl::PointXYZ>::Ptr& input_cloud_ptr,
const Plane& plane,
std::function<bool(float)> condition_z_fn)
{
typedef Eigen::Transform<float, 3, Eigen::Affine, Eigen::DontAlign> Transform3f;
auto base_point = plane.base_point;
auto normal = plane.normal;
// Before rotation of the coordinate frame we need to relocate the point cloud to
// the position of base_point of the plane.
Transform3f world_to_ransac_base = Transform3f::Identity();
world_to_ransac_base.translate(-base_point);
auto ransac_base_cloud_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZ> >();
pcl::transformPointCloud(*input_cloud_ptr, *ransac_base_cloud_ptr, world_to_ransac_base);
// We are going to use a quaternion to determine the rotation transform
// which is required to rotate a coordinate system that plane's normal
// becomes aligned with Z coordinate axis.
auto rotate_to_plane_quat = Eigen::Quaternionf::FromTwoVectors(
normal,
Eigen::Vector3f::UnitZ()
).normalized();
// Now we can create a rotation transform and align the cloud that
// the candidate plane matches XY plane.
Transform3f ransac_base_to_ransac = Transform3f::Identity();
ransac_base_to_ransac.rotate(rotate_to_plane_quat);
auto aligned_cloud_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZ> >();
pcl::transformPointCloud(*ransac_base_cloud_ptr, *aligned_cloud_ptr, ransac_base_to_ransac);
// Once the point cloud is transformed into the plane coordinates,
// We can apply a simple criterion on Z coordinate to find inliers.
std::vector<size_t> indices;
for (size_t i_point = 0; i_point < aligned_cloud_ptr->size(); i_point++)
{
const auto& p = (*aligned_cloud_ptr)[i_point];
if (condition_z_fn(p.z))
{
indices.push_back(i_point);
}
}
return indices;
}
And finally, the main Ransac implementation is shown below. The first step will be to perform rough filtering of the points based on the Z coordinate. Also, we need to decimate the points once again since we do not need all the points in the aggregated cloud to validate a plane candidate. These operations can be performed in a single pass.
Next, we start the iterations. Every iteration samples 3 random points with the help of std::mt19937 pseudo-random generator of the C++ standard library. For each triplet, we calculate the plane and ensure that its normal is pointed upwards. Then we use the same helper function find_inlier_indices to calculate the number of inliers.
After the iterations have passed, we are left with the best plane candidate which we finally use to perform the copy of all points in the point cloud whose indices do not present in the list. Please notice the use of std::unordered_set<>. It allows performing a constant-time O(1) search of an index instead of a linear O(N) search that we would have for std::vector<>.
// This function performs plane detection with RANSAC sampling of planes
// that lie on triplets of points randomly sampled from the cloud.
// Among all trials the plane that is picked is the one that has the highest
// number of inliers. Inlier points are then removed as belonging to the ground.
auto remove_ground_ransac(
pcl::PointCloud<pcl::PointXYZ>::Ptr input_cloud_ptr)
{
// Threshold for rough point dropping by Z coordinate (meters)
const float rough_filter_thr = 0.5f;
// How much to decimate the input cloud for RANSAC sampling and inlier counting
const size_t decimation_rate = 10;
// Tolerance threshold on the distance of an inlier to the plane (meters)
const float ransac_tolerance = 0.1f;
// After the final plane is found this is the threshold below which all
// points are discarded as belonging to the ground.
const float remove_ground_threshold = 0.2f;
// To reduce the number of outliers (non-ground points) we can roughly crop
// the point cloud by Z coordinate in the range (-rough_filter_thr, rough_filter_thr).
// Simultaneously we perform decimation of the remaining points since the full
// point cloud is excessive for RANSAC.
std::mt19937::result_type decimation_seed = 41;
std::mt19937 rng_decimation(decimation_seed);
auto decimation_gen = std::bind(
std::uniform_int_distribution<size_t>(0, decimation_rate), rng_decimation);
auto filtered_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZ> >();
for (const auto& p : *input_cloud_ptr)
{
if ((p.z > -rough_filter_thr) && (p.z < rough_filter_thr))
{
// Use random number generator to avoid introducing patterns
// (which are possible with structured subsampling
// like picking each Nth point).
if (decimation_gen() == 0)
{
filtered_ptr->push_back(p);
}
}
}
// We need a random number generator for sampling triplets of points.
std::mt19937::result_type sampling_seed = 42;
std::mt19937 sampling_rng(sampling_seed);
auto random_index_gen = std::bind(
std::uniform_int_distribution<size_t>(0, filtered_ptr->size()), sampling_rng);
// Number of RANSAC trials
const size_t num_iterations = 25;
// The best plane is determined by a pair of (number of inliers, plane specification)
typedef std::pair<size_t, Plane> BestPair;
auto best = std::unique_ptr<BestPair>();
for (size_t i_iter = 0; i_iter < num_iterations; i_iter++)
{
// Sample 3 random points.
// pa is special in the sense that is becomes an anchor - a base_point of the plane
Eigen::Vector3f pa = (*filtered_ptr)[random_index_gen()].getVector3fMap();
Eigen::Vector3f pb = (*filtered_ptr)[random_index_gen()].getVector3fMap();
Eigen::Vector3f pc = (*filtered_ptr)[random_index_gen()].getVector3fMap();
// Here we figure out the normal to the plane which can be easily calculated
// as a normalized cross product.
auto vb = pb - pa;
auto vc = pc - pa;
Eigen::Vector3f normal = vb.cross(vc).normalized();
// Flip the normal if points down
if (normal.dot(Eigen::Vector3f::UnitZ()) < 0)
{
normal = -normal;
}
Plane plane{pa, normal};
// Call find_inlier_indices to retrieve inlier indices.
// We will need only the number of inliers.
auto inlier_indices = find_inlier_indices(filtered_ptr, plane,
[ransac_tolerance](float z) -> bool {
return (z >= -ransac_tolerance) && (z <= ransac_tolerance);
});
// If new best plane is found, update the best
bool found_new_best = false;
if (best)
{
if (inlier_indices.size() > best->first)
{
found_new_best = true;
}
}
else
{
// For the first trial update anyway
found_new_best = true;
}
if (found_new_best)
{
best = std::unique_ptr<BestPair>(new BestPair{inlier_indices.size(), plane});
}
}
// For the best plane filter out all the points that are
// below the plane + remove_ground_threshold.
pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_no_ground_ptr;
if (best)
{
cloud_no_ground_ptr = std::make_shared<pcl::PointCloud<pcl::PointXYZ> >();
auto inlier_indices = find_inlier_indices(input_cloud_ptr, best->second,
[remove_ground_threshold](float z) -> bool {
return z <= remove_ground_threshold;
});
std::unordered_set<size_t> inlier_set(inlier_indices.begin(), inlier_indices.end());
for (size_t i_point = 0; i_point < input_cloud_ptr->size(); i_point++)
{
bool extract_non_ground = true;
if ((inlier_set.find(i_point) == inlier_set.end()) == extract_non_ground)
{
const auto& p = (*input_cloud_ptr)[i_point];
cloud_no_ground_ptr->push_back(p);
}
}
}
else
{
cloud_no_ground_ptr = input_cloud_ptr;
}
return cloud_no_ground_ptr;
}
Let’s see the results of the ground surface elimination.
Having the ground removed, we are ready to clusterize the remaining points and compress the object metadata by means of convex hull extraction. These two stages deserve their own article. I will introduce their implementation in part two which is coming soon. Meanwhile below is the final result of clusterization the convex hull extraction.
Convex hulls are definitely the type of metadata that any tracker is eager to accept as its input. They are more compact in terms of RAM usage and represent an object’s boundary more accurately than oriented bounding boxes. Here is another visualization that is taken from Kitti sequence 0003.
I am confident that Autonomous Driving will be a leap forward for humanity in terms of quality of life and overall productivity. It will finally make an “auto”-”mobile” truly a finally automated mobility device. Autonomous driving requires the highest level of software engineering. I hope many engineers at the start of their careers will be encouraged to develop their skills to perfection after reading this article.
[1]
[3]
Also published here.