Can Docker Keep Up with the Speed of Self-Driving Cars?

A typical ROS 2 application can be abstracted in several layers ranging from high-level applications to the foundational hardware. We define the layers as depicted in Fig. 3. With the use of containerization, the container runtime adds an additional layer. Positioned above the operating system, this layer facilitates the creation, execution, and management of containers. These executable software packages encapsulate an application and its dependencies. Our study focuses on

understanding the influence of containerization on ROS 2 applications. Specifically, our experiments were systematically designed with an increasing complexity:

• DDS Communication: This experiment examines the pure communication performance of DDS in isolation.

• ROS 2: A publish/subscribe example is introduced to observe the performance implications of DDS and ROS 2.

• Real-World Autonomous Driving Application: Incorporates the impact of containerization on the developed microservice architecture.

For each of the experiments, the three increasing-isolation deployments (Fig. 1) have been evaluated. The first scenario (bare-metal) serves as reference point and tests run natively on the system without containerization. In the second scenario we ran the test within a single container. This aims to measure the overheads introduced by containerizations in the first place. The third scenario, multi-container, placed the respective benchmark algorithms in separate containers.

In this section, we first introduce our hardware setup and the specific configurations of our containerization architecture. Afterward, we describe the DDS, ROS 2, and Autoware benchmark setups with their individual metrics.

A. Hardware Setup and Software Configurations

All experiments were conducted on two distinct computing platforms (one x86 and one aarch64 (Armv8)), as depicted in Table II. The two platforms are representative of autonomous driving platforms for SDVs [40] and use the same GPU, OS, and Kernel version. On the x86 computing system, we disabled hyperthreading to minimize potential performance fluctuations. The corresponding experiments are performed with ROS 2 Humble Hawksbill with the underlying middleware Eclipse CycloneDDS [41]. We chose Docker (version 24.0.5) as containerization technology due to its advanced GPU integration capabilities, which provide an obvious advantage over alternative solutions like Podman. To orchestrate the microservice architecture, we utilized k3s (version v1.27.3+k3s1) to deploy and manage the containers. We employed the nvidia-docker2 package to enable GPU support for Docker and the nvidia-device-plugin for k3s. The container pods are configured in such a way that they communicate over the local host network. No CPU requests or limits are set in the configuration. The standard Linux Completely Fair Scheduler (CFS) is used for every experiment. Despite not being a “true” real-time setup, we are interested in replicating a soft real-time environment that reflects the typical setups for software-defined architectures adopted by the practitioners [1], [42].

B. Benchmarks

1) DDS Communication: To test DDS communication, we use the ddsperf benchmark from the Eclipse CycloneDDS. This benchmark focuses purely on DDS communication, as it skips the ROS 2 abstraction layer. This approach enables us to investigate the influence of containers on pure DDS communication. The experiment uses a straightforward “ping pong” communication pattern to analyze containerization’s impact on DDS performance. This pattern consists of continuously sending a defined message size back and forth between two nodes. In the multi-container scenario, each node is placed in an individual container. CycloneDDS can be configured in two modes: reliable and best-effort. In the best-efforts setting, a publisher sends messages without any assurance that the recipient will receive them correctly. Conversely, in reliable mode, the publisher continues sending messages until it receives an acknowledgment from the subscriber indicating successful reception. Given that best-effort is the default setting for most nodes in the Autoware software, we opt for this mode for our study. Another crucial aspect was the variation in message size. Starting at 1 kB, the size was gradually increased by doubling message sizes to analyze the impact on performance across a spectrum of message sizes up to 8 MB. This variation allowed us to assess the scalability and efficiency of DDS communication under different load conditions. Finally, each test was run three times to ensure reproducibility and consistency of results. Each run with a different message size lasted 30 minutes. This time period was chosen, in particular, to ensure that a sufficient number of packets could still be exchanged during the tests with the largest message sizes.

2) ROS 2: We used the NVIDIA-ISAAC-ROS ros2_benchmark from [34] to evaluate the impact of containerization on simple ROS 2 applications. This benchmark framework is sophisticated and allows testing several example ROS 2 graphs. From the ros2_benchmark, we chose the AprilTag [43] node as a reference for our evaluation. The benchmark includes a playback node that sends camera data, which is in turn processed by the AprilTag detection node. The benchmark also comprises a data-loader node that loads the rosbag r2b storage data into a buffer and sends it to the playback node. A monitoring node for benchmark-internal evaluations (e.g., CPU monitoring) is also included. In the bare-metal configuration, we run the entire framework without changes to the systems. In the single-container configuration, we put the playback and detector nodes inside the single container, whereas in the multi-container configuration, we separate both nodes into individual containers. We let the benchmark complete a total of 100 runs per each deployment type. Each individual run consists of 5 internal iterations. Eventually, the benchmark outputs a statistical result for the five iterations, which we merge accordingly for the 100 runs. In our experiments, the benchmark tests four different setups in terms of the publishing frequency of the playback node: 10 fps (100 ms), 30 fps (33.3 ms), 60 fps (16.7 ms), and an additional setup where the system is configured to achieve the maximum throughput. With increased framerate, the workload for the system also grows. Therefore different stress levels of the system can be evaluated.

C. Real-World Autonomous Driving Application

We evaluated the performance impact of containerization on Autoware in the microservice architecture presented in Section III. In the bare-metal setup, the Autoware software is created natively on the system and launched accordingly. In the container environments, Autoware is installed inside of one container. The launch command of the bare-metal variant is defined as an entry point in the container and can then be started with k3s. For the microservice architecture, as previously described, each module has its individual launch command defined in the entry point of the container. For all three deployment variants, it is guaranteed that the same software version is compared. We leverage the orchestration framework proposed in [37] to simulate in a closed-loop the deployed Autoware variants using the AWSIM environment. The Autoware software is executed standalone on the described compute platforms, and the simulation is executed on a different compute unit. The vehicle is driving on a defined test route in Nishi-Shinjuku in Tokyo, Japan. Traffic participants were removed from the simulation because they cannot be simulated in a reproducible manner. Each experiment is repeated until 100 valid runs can be evaluated. Each test drive takes approximately two minutes to reach the goal pose.

D. Metrics

It is important to develop metrics at both application and system levels to analyze the impact of containerization. Such metrics provide valuable insights into resource utilization, helping to identify the latency impact induced by containerization. However, benchmarks are often published with their metrics, making it difficult to evaluate all experiments consistently. In the following, we will go into more detail about the metric used for each experiment.

1) DDS Communication: The benchmark provides the throughput of packets sent during the test period. In addition, the round trip latency is displayed, which is the time it takes for a message to be sent from the source node to the destination node and back again. The benchmark does not provide the CPU load during the execution. After the tests, we calculate the average round trip time and the average throughput.

2) ROS 2: The framework outputs different metrics for each test node. We evaluate the mean end-to-end latency from sending the raw data until the test node generates an output. This metric is calculated internally in the benchmark via tracing points. Also the mean jitter of the corresponding node is measured. Additionally, the framework provides insight into CPU utilization. We evaluate the average CPU utilization over the test runs.

3) Real-World Autonomous Driving Application: The complex ROS 2 Autoware setup is evaluated using the data-age (end-to-end latency) metric, shown [44] to be equivalent to the reaction time. It is the average of path durations with the same sensor input. For this, the framework of [29] is used, which can determine the end-to-end latency for Autoware accordingly. The computation is based on ros2_tracing, which places corresponding trace points in the rclcpp client library of the ROS 2 middleware. To enable tracing while using the containerized architecture of Autoware, it was necessary to mount specific LTTng related file information from the host system into each of the containers. Inside the containers ros2_tracing must be enabled. For the bare-metal and containerized measurements, the tracing session was executed on the host system. The framework computes the total end-to-end latency as well as its individual components:

• The idle latency or intra-node communication latency defines the time between a subscription callback and a timer callback of a ROS 2 node.

• The communication latency is the time between publishing and receiving a ROS 2 message via a subscription callback. It corresponds approximately to the time needed for the DDS communication.

• The compute latency describes the time it takes to process the input from a subscription and publish the corresponding output data to the subsequent node.

Since Autoware consists of a large number of individual

computational chains, we selected a single chain for evaluating latency. This chain, detailed in Table III, was chosen to traverse as many containers as possible for a more accurate assessment of their influence. Furthermore, it represents the critical path with the highest latency in the application. The quality of service setting is configured to “keep last,” operating in besteffort mode with a queue length of 1. To measure the CPU and memory utilization of Autoware, we recorded the process status using Linux ps. We recorded the information for all processes every 200 ms. As we are interested in the influence of the containerized ROS 2 application, the processes are correspondingly filtered after the session to ROS 2, Docker, Kubernetes, and Autoware processes.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.