Towards a distributed and real-time framework for robots

Evaluation of ROS 2.0 communications for real-time robotic applications

The content of this article comes from “Real-time Linux communications: an evaluation of the Linux communication stack for real-time robotic applications” available at https://arxiv.org/pdf/1809.02595.pdf. Peer written with Carlos San Vicente Gutiérrez, Lander Usategui San Juan and Irati Zamalloa Ugarte.

In this work we present an experimental setup to show the suitability of ROS 2.0 for real-time robotic applications. We disclose an evaluation of ROS 2.0 communications in a robotic inter-component (hardware) communication case on top of Linux. We benchmark and study the worst case latencies and missed deadlines to characterize ROS 2.0 communications for real-time applications. We demonstrate experimentally how computation and network congestion impacts the communication latencies and ultimately, propose a setup that, under certain conditions, mitigates these delays and obtains bounded traffic

Introduction

In robotic systems, tasks often need to be executed with strict timing requirements. For example, in the case of a mobile robot, if the controller has to be responsive to external events, an excessive delay may result in non-desirable consequences. Moreover, if the robot was moving with a certain speed and needs to avoid an obstacle, it must detect this and stop or correct its trajectory in a certain amount of time. Otherwise, it would likely collide and disrupt the execution of the required task. These kind of situations are rather common in robotics and must be performed within well defined timing constraints that usually require real-time capabilities. Such systems often have timing requirements to execute tasks or exchange data over the internal network of the robot, as it is common in the case of distributed systems. This is, for example, the case of the Robot Operating System (ROS)[1]. Not meeting the timing requirements implies that, either the system’s behavior will degrade, or the system will lead to failure.

Real-time systems can be classified depending on how critical to meet the corresponding timing constraints. For hard real-time systems, missing a deadline is considered a system failure. Examples of such real-time systems are anti-lock brakes or aircraft control systems. On the other hand, firm real-time systems are more relaxed. An information or computation delivered after a missing a deadline is considered invalid, but it does not necessarily lead to system failure. In this case, missing deadlines could degrade the performance of the system. In other words, the system can tolerate a certain amount of missed deadlines before failing. Examples of firm real-time systems include most professional and industrial robot control systems such as the control loops of collaborative robot arms, aerial robot autopilots or most mobile robots, including self-driving vehicles.Finally, in the case of soft real-time, missed deadlines -even if delivered late- remain useful. This implies that soft real-time systems do not necessarily fail due to missed deadlines, instead, they produce a degradation in the usefulness of the real-time task in execution. Examples of soft-real time systems are telepresence robots of any kind (audio, video, etc.).

As ROS became the standard software infrastructure for the development of robotic applications, there was an increasing demand in the ROS community to include real-time capabilities in the framework. As a response, ROS 2.0 was created to be able to deliver real-time performance, however, as covered in previous work [2] and [3], the ROS 2.0 itself needs to be surrounded with the appropriate elements to deliver a complete distributed and real-time solution for robots.

For distributed real-time systems, communications need to provide Quality of Services (QoS) capabilities in order to guarantee deterministic end-to-end communications. ROS 2 communications use Data Distribution Service (DDS) as its communication middleware. DDS contains configurable QoS parameters which can be tuned for real-time applications. Commonly, DDS distributions use the Real Time Publish Subscribe protocol (RTPS) as a transport protocol which encapsulates the well known User Datagram Protocol (UDP). In Linux based systems, DDS implementations typically use the Linux Networking Stack (LNS) for communications over Ethernet.

In previous work [2], we analyzed the use of layer 2 Quality of Service (QoS) techniques such as package prioritization and Time Sensitive Networking (TSN) scheduling mechanisms to bound end-to-end latencies in Ethernet switched networks. In [3], we analyzed the real-time performance of the LNS in a Linux PREEMPT-RT kernel and observed some of the current limitations for deterministic communications over the LNS in mixed-critical traffic scenarios. The next logical step was to analyze the real-time performance of ROS 2.0 communications in a PREEEMPT-RT kernel over Ethernet. Previous work [4] which investigated the performance of ROS 2.0 communication showed promising results and discussed future improvements. However, the mentioned study does not explore the suitability of ROS 2.0 for real-time applications and the evaluation was not performed on an embedded platform.

In this work, we focus on the evaluation of ROS 2.0 communications in a robotic inter-component communication use-case. For this purpose, we are going to present a setup and a set of benchmarks where we will measure the end-to-end latencies of two ROS 2.0 nodes running in different static load conditions. We will focus our attention on worst case latencies and missed deadlines to observe the suitability of ROS 2.0 communications for real-time applications. We will also try to show the impact of different stressing conditions in ROS 2.0 traffic latencies. Ultimately, we attempt to find a suitable configuration to improve determinism of ROS 2.0 and establish the limits for such setup in an embedded platform.

A bit of background

Overview of ROS 2 stack for machine to machine communications over Ethernet

ROS is a framework for the development of robot applications. A toolbox filled with utilities for robots, such as a communication infrastructure including standard message definitions, drivers for a variety of software and hardware components, libraries for diagnostics, navigation, manipulation and many more. Altogether, ROS simplifies the task of creating complex and robust robot behavior across a wide variety of robotic platforms. ROS 2.0 is the new version of ROS which extends the initial concept (originally meant for purely research purposes) and aims to provide a distributed and modular solution for situations involving teams of robots, real-time systems or production environments, amidst others. Among the technical novelties introduced in ROS 2.0, Open Robotics explored several options for the ROS 2.0 communication system. They decided to use the DDS middleware due to its characteristics and benefits compared to other solutions. As documented in [5], the benefit of using an end-to-end middleware, such as DDS, is that there is less code to maintain. DDS is used as a communications middleware in ROS 2.0 and it typically runs as userspace code. Even though DDS has specified standards, third parties can review audit, and implement the middleware with varying degrees of interoperability.

As pointed out in the technical report [6], to have realtime performance, both a deterministic user code and an real-time operating system are needed. In our case, we will use a PREEMPT-RT patched Linux kernel as the core of our operating system for the experiments. Following the programming guidelines of the PREEMPT-RT and with a suitable kernel configuration, other authors[7] demonstrated that it is possible to achieve system latency responses between 10 and 100 microseconds.

Normally, by default, DDS implementations use the Linux Network Stack (LNS) as transport and network layer. This makes the LNS a critical part for ROS 2.0 performance. However, the network stack is not optimized for bounded latencies but instead, for throughput at a given moment. In other words, there will be some limitations due to the current status of the networking stack. Nevertheless, LNS provides QoS mechanisms and thread tuning which allows to improve the determinism of critical traffic at the kernel level.

An important part of how the packets are processed in the Linux kernel relates actually to how hardware interrupts are handled. In a normal Linux kernel, hardware interrupts are served in two phases. In the first, an Interrupt Service Routine (ISR) is invoked when an interrupt fires, then, the hardware interrupt gets acknowledged and the work is postponed to be executed later. In a second phase, the soft interrupt, or “bottom half” is executed later to process the data coming from the hardware device. In PREEMPT-RT kernels, most ISRs are forced to run in threads specifically created for the interrupt. These threads are called IRQ threads [8]. By handling IRQs as kernel threads, PREEMPT-RT kernels allow to schedule IRQs as user tasks, setting the priority and CPU affinity to be managed individually. IRQ handlers running in threads can themselves be interrupted so the latency due to interrupts is mitigated. For our particular interests, since our application needs to send critical traffic, it is possible to set the priority of the Ethernet interrupt threads higher than other IRQ threads to improve the network determinism.

Another important difference between a normal and a PREEMPT-RT kernel is within the context where the softirq are executed. Starting from kernel version 3.6.1-rt1 on, the soft IRQ handlers are executed in the context of the thread that raised that Soft IRQ [9]. Consequently, the NET_RX soft IRQ, which is the softirq for receiving network packets, will normally be executed in the context of the network device IRQ thread. This allows a fine control of the networking processing context. However, if the network IRQ thread is preempted or it exhausts its NAPI weight time slice, it is executed in the ksoftirqd/n (where n is the logical number of the CPU).

Processing packets in ksoftirqd/n context is troublesome for real-time because this thread is used by different processes for deferred work and can add latency. Also, as the ksoftirqd thread runs with SCHED_OTHER policy, it can be easily preempted. In practice, the soft IRQs are normally executed in the context of the Ethernet IRQ threads and in the ksoftirqd/n thread, for high network loads and under heavy stress (CPU, memory, I/O, etc.). The conclusion here is that, in normal conditions, we can expect reasonable deterministic behavior, but if the network and the system are loaded, the latencies can increase greatly.

A sneak peak into the results

Throughout the experimental tests, the following results were obtained. Details can be obtained in the original publication:

Impact of RT settings under different system load. a) System without additional load without RT settings. b) System under load without RT settings. c) System without additional load and RT settings. d) System under load and RT settings.

Read the full article at https://arxiv.org/pdf/1809.02595.pdf.

References

[1] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2. Kobe, Japan, 2009, p. 5.

[2] C. S. V. Gutiérrez, L. U. S. Juan, I. Z. Ugarte, and V. M. Vilches, “Time-sensitive networking for robotics,” CoRR, vol. abs/1804.07643, 2018. [Online]. Available: http://arxiv.org/abs/1804.07643

[3] C. S. V. Gutiérrez, L. Usategui San Juan, I. Zamalloa Ugarte, and V. Mayoral Vilches, “Real-time Linux communications: an evaluation of the Linux communication stack for real-time robotic applications,” ArXiv e-prints, Aug. 2018.

[4] Y. Maruyama, S. Kato, and T. Azumi, “Exploring the performance of ros2,” in 2016 International Conference on Embedded Software (EMSOFT), Oct 2016, pp. 1–10.

[5] “Ros 2.0 design,” http://design.ros2.org/, accessed: 2018–07–27.

[6] “Introduction to Real-time Systems,” http://design.ros2.org/articles/ realtime_background.html, accessed: 2018–04–12.

[7] F. Cerqueira and B. B. Brandenburg, “A comparison of scheduling latency in linux, preempt rt, and litmus rt.”

[8] J. Edge, “Moving interrupts to threads,” October 2008, [Accessed: 2018–04–12]. [Online]. Available: https://lwn.net/Articles/302043/

[9] J. Corbet, “Software interrupts and realtime,” October 2012, [Accessed: 2018–04–12]. [Online]. Available: https://lwn.net/Articles/ 520076/