DDoS attacks mitigation in the wild requires various techniques to be tested and learned. Hardware and software network solutions need to be tested in artificial environments close to real-life ones, with massive traffic streams imitating attacks. Without such experience, one would never acknowledge the specific capabilities and limitations every sophisticated tool has.
In this article, we are going to disclose certain methods of traffic generation used in Qrator Labs.
We notoriously advise any and every reader not to try any offensive use of the tools we write about in this research. Organization of DoS attacks is legally persecuted and could lead to lengthy imprisonment. Qrator Labs responsibly conducts all tests within an isolated laboratory environment.
The challenging problem in our field is to saturate a 10G Ethernet interface with small packets, i.e., to deal with 14.88 Mpps (millions of packets per second). Here and below only the smallest Ethernet packets are considered, i.e., of 64 bytes size, since we interested in maximum packet rate. A simple reckoning shows that we have only about 67 nanoseconds to treat a single packet. Just for your comparison — this value is close to the time required to reach a piece of data from the main memory in case of a cache miss on a modern CPU. The things become much more complicated if we have to deal with 40G or 100G Ethernet interfaces, trying to saturate them.
Since a typical data plane passes from a userspace application through the kernel to the NIC, the first and straightforward idea to improve network performance is to implement packet generation directly in the kernel. An example of such solution is pktgen  kernel module. That way does help to improve performance, but it is not flexible enough, and also any changes in kernel source code require a longer implement-load-test loop, thus leading to lower productivity (i.e., time and efforts spent by a programmer).
Another approach is to provide direct access from the userspace to the buffers mapped into the NIC. This way is more complicated, but it is worth to try to achieve higher performance. The drawbacks are higher complexity and lower flexibility. Examples are netmap, PF_RING, and DPDK .
Another reasonable but expensive way to achieve high performance is specialized hardware, for example, Ixia.
There are also solutions based on DPDK and using scripting, thus giving some flexibility in controlling parameters of the generator and varying packets issued at the runtime. Below we also describe our experience in using one such tool, MoonGen.
Main MoonGen features are:
1. DPDK userspace dataflow processing, from where all the high performance grows;
2. Lua  stack with user-friendly scripts on its top and bindings to C-based DPDK on its backend;
3. Thanks to the JIT scripts written in Lua are working quite fast, on the contrary to what people usually expect from a scripting language;
MoonGen may be treated as a Lua wrapper around DPDK library. At least the following DPDK operations are exposed to the user Lua interface:
MoonGen is a scriptable high-speed packet generator built on DPDK. A Lua script controls the whole load generator: a user-provided script crafts all packets that are further sent. Thanks to the incredibly fast LuaJIT VM and the packet processing library DPDK, it can saturate a 10 Gbps Ethernet link with 64 Byte packets while using only a single CPU core. MoonGen can achieve this rate even if a Lua script modifies each packet. It does not rely on tricks like replaying the same buffer.
MoonGen can also receive packets, e.g., to check which packets are dropped by a system under test. As the reception is also entirely under control of the user’s Lua script, it can be used to implement advanced test scripts. E.g., one can use two instances of MoonGen that establish a connection with each other. This setup can be used to benchmark middle-boxes like firewalls.
MoonGen focuses on four main points:
DPDK is the Data Plane Development Kit that consists of libraries to accelerate packet processing workloads running on a wide variety of CPU architectures.
In a world where the network is becoming fundamental to the way people communicate, performance, throughput, and latency are increasingly important for applications like wireless core and access, wireline infrastructure, routers, load balancers, firewalls, video streaming, VoIP, and more.
DPDK is a lightweight and ubiquitous way of building your tests and scripts. Userspace dataflow is not something we often see, because usually, the application communicates with network hardware through the OS and kernel stack, which is opposite to how DPDK operates.
In general, Lua strives to provide simple, flexible meta-features that can be extended as needed, rather than supply a feature-set specific to one programming paradigm. As a result, the base language is light — the full reference interpreter is only about 180 kB compiled — and easily adaptable to a broad range of applications.
Lua is a dynamically typed language intended for use as an extension or scripting language and is compact enough to fit on a variety of host platforms. It supports only a small number of atomic data structures such as boolean values, numbers (double-precision floating point by default), and strings. Typical data structures such as arrays, sets, lists, and records can be represented using Lua’s single native data structure, the table, which is a heterogeneous associative array.
Lua uses JIT (just in time) compilation, so, being a scripting language, it still has performance comparable to compiled languages like C .
Being an anti-DDoS company, Qrator Labs needs to develop, modernize, and verify its protective solutions. To test those, some sorts of traffic generators imitating real attacks are needed. However, it is not so easy to imitate a dangerous but straightforward flood attack at the L2, L3 levels of OSI, since it may be tricky to achieve high performance in traffic generation.
In other words, it is quite natural for a DDoS mitigation company to simulate various DoS attacks within an isolated laboratory environment to learn the real-life behavior of different hardware setups.
MoonGen is a way to generate NIC line rate amounts of traffic with minimum CPU cores. Userspace dataflow increases the performance of such (MoonGen + DPDK) stack dramatically, compared to many other ways of generating large amounts of traffic. Using pure DPDK requires more effort, so one shouldn’t be surprised by our workflow optimization efforts. We also maintain a clone  of the original MoonGen repository in order to extend its functionality and to implement some specific tests.
In order to achieve maximum flexibility, the packet generation logic is described by user-defined Lua scripts, that is one of the main features of MoonGen. In a case of relatively simple packet mangling, this solution appears to work fast enough to saturate a 10G interface by a single CPU core. A typical way of mangling incoming packets and crafting new ones is just to deal with packets of the same type and to vary a number of their fields.
Consider the l3-tcp-syn-ack-flood example, described below. Note, that in all its activities, any operation of mangling a packet may be performed in the same buffer, where the incoming or previously generated packets reside. Actually, such mangling operations are very fast, since they do not require expensive actions like system calls, access to potentially uncached memory, and so on.
Qrator Labs maintains a laboratory with various hardware. Here are some NIC samples being used and tested there:
Note that when we deal with network interface controllers beyond 10G, so performance problem becomes more urgent. Nowadays it seems to be impossible to saturate a 40G interface by a single CPU core, though a few cores are sufficient to do that.
For Mellanox NICs, one could tune some settings of an appliance using a manufacturer’s tuning guides  to achieve higher performance or, needed in some cases, to alternate NICs behavior. Other NICs manufacturers provide similar documents for high-performance devices, even if you cannot find one, you may contact the company directly. In our case Mellanox answered fast, helping us achieve 100% bandwidth utilization in needed tasks.
The l3-tcp-syn-ack-flood example is purposed to imitate a SYN flood attack . It is an extended version of the l3-tcp-syn-flood from MoonGen central repository and is being developed by Qrator Labs in a cloned repository . Our test can perform three kinds of activities:
For example, the inner-loop boilerplate code of crafting ACK replies looks like this:
Generally, the idea of crafting a reply is the following. First, extract a packet from the RX queue, then check if the received packet is of the expected type. If yes, prepare the answer by modifying some fields of the original packet. Finally, place the forged packet to the TX queue reusing the same buffer. To improve performance, instead of manipulating packets one by one, we aggregate them by grabbing all available packets from the RX queue, crafting the respective answers and putting them into the TX queue. Despite a relatively high number of operations with a single packet, the performance remains sufficient, since Lua JIT compiles all those operations into a few CPU instructions. Plenty of other tests, not only TCP SYN/ACK, are implemented in the same manner.
The table below shows the result of SYN flood test (only SYN generation without replying) run on Mellanox ConnectX-4. This NIC has two 40G ports, and its theoretical peak performance is 59.52 Mpps for a single port and 2 * 50 Mpps for both ports. A particular NIC connection with PCIe causes the latter bandwidth restriction (only 2 * 50 instead of expected 2 * 59.52).
The next table shows the result of the same SYN flood test on Mellanox ConnectX-5 having a single 100G port.
Note, in all those cases more than 96% of the theoretical peak performance is reached with a few CPU cores.
Another example is rx-to-pcap, which tries to save all incoming traffic into a number of PCAP files . Though this test is not about packet generation, it may serve as a demonstration to the fact that the weakest link in such data flow is the file system: even the virtual tmpfs file system slows down the stream. In this case, 8 CPU cores are needed to utilize 14.88 Mpps, while just a single core is sufficient to receive (and drop or redirect) the same amount of traffic.
The following table shows the amount of traffic (in Mpps) received and saved in PCAP files, and those files reside on the ext2 file system on a solid state disk (the second column), or on the tmpfs file system (the third column).
We also introduce an extension to MoonGen, representing an alternative way to start a group of tasks. The idea is to separate general configuration and task-specific options to run an arbitrary number of various tasks (i.e., Lua scripts) simultaneously. The implementation is exposed in Qrator’s repository clone and described there , let us recap.
A newer CLI interface allows to start a number of various tasks at once. Here is the synopsis:
Also ./build/tman -h gives more enhanced help.
Ordinary task Lua files, however, are incompatible with the tman interface. A task file for tman has to determine the following objects:
See examples in examples/tman/.
Using the task manager gives more flexibility in running heterogeneous tasks.
The MoonGen approach appeared to be quite satisfiable for our goals, since it gives high performance, while still keeping tests written in a script language simple. The performance is achieved mainly due to two features: the direct NIC buffers access and JIT technique in Lua.
It is usually possible to achieve the theoretical peak performance of a NIC. A single core may be sufficient to saturate a 10G port, while a few cores may be enough to saturate a 100G port.
We thank Mellanox team for collaboration and MoonGen team for their reaction in bug fixing.