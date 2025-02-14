Editor's note: This is Part 6 of 6 of a study detailing attempts to optimize layer-7 load balancing. Read the rest below.

6 Conclusion

We propose Laconic, which explores the potential of offloading L7 load balancing onto SmartNICs and addresses the associated challenges. In Laconic, we employ a lightweight network stack to avoid the costs of a heavy, feature-rich network stack. To minimize the cost of synchronization operations, we design lightweight synchronization mechanisms that enable higher concurrency and scalability. To further reduce the burden on the generic cores on the SmartNIC, we offload packet processing by effectively utilizing the hardware acceleration engines available on SmartNICs. This paper does not raise any ethical issues.

References

A Appendix





















A.1 Throughput of Laconic using all cores

Figure 13 shows the throughput of Laconic on BlueField-2 with seven ARM cores [1]. The BlueField-2 is equipped with 2x100Gbps ports. We enable the two ports together to further stress the processing power on the ARM cores. For the baseline of Nginx running on the commodity x86 servers, we use 2x100Gbps ports of a ConnectX-5 NIC to make the link speed the same. However, although there are 2x100Gbps ports on ConnectX-5, due to the bottleneck of PCIe 3.0 for testbed servers, only about 120 Gbps can be achieved. As shown in Figure 13, Laconic with only seven NIC cores can achieve line rate[2] for large message sizes. For small message sizes, Laconic achieves comparable performance with Nginx running on x86. BlueField-2 has wimpy ARM cores, and Nginx running on ARM performs poorly, as expected. There is a clear increase in throughput for messages larger than 1M because Laconic offloads the packet processing into the hardware engine for response messages larger than 1M. Other than the first few packets, all the remaining packets in that flow are solely processed by the hardware packet processing engine. This can mitigate the head-of-line blocking problem caused by limited processing power and also relieves the CPU resources for small or future responses. In later Section 4.4, we show experiments with varying offloading thresholds.





Figure 14 shows the throughput of Laconic on LiquidIO3 50G NIC with 24 ARM cores. Since the LiqudIO3 can only support 50 Gbps link speed, we also cap the x86 baselines at 50 Gbps for a fair comparison of all the experiments running on LiquidIO3. Nginx on the ARM cores of the SmartNIC (bar "LIO3 Nginx") has a low throughput. The throughput plateaus at around 5 Gbps even with 24 cores. Laconic can easily reach the full NIC bandwidth with 10K response size. However, Nginx running on an x86 server needs 32 beefy cores to achieve comparable performance.

[1] BlueField-2 has eight available ARM cores in total. We reserve one core for the necessary background tasks.





[2] Since the same port on the SmartNIC receives data from both the requests of the clients and also receive the responses from the backend servers, the theoretical upper limit for the load balancer’s goodput is lower than the line rate 2x100Gbps.





This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.



