Load Testing for High-Load Distributed Systems

Written by boomeer | Published 2023/04/20
Tech Story Tags: software-testing | distributed-systems | kafka | software-development | staging-environments | load-test | highload | scalability

TLDRThis article examines various load testing methods for high-load distributed systems using Kafka, discussing the pros and cons of staging, isolation, and emulation, and offering guidance for choosing the optimal approach.via the TL;DR App

We have a service that interacts with the external world and is very high-load (we estimated and prepared for 100 RPS). Let's call this service ServiceX, a kind of proxy-orchestrator service. This service communicates with a set of our internal services and looks into the external world.

We tested this service and thought we were ready for high loads, but that turned out not to be the case. Yes, we managed to save the service while under fire, and it took us less than two days to do so.

However, as a result, we recognized the need to be better prepared for such problems in advance, not when the service is being bombarded with 100 RPS, and the business suffers from a malfunction.


The cornerstone of the problem

As it turned out, the write load we emulated accounts for only 5% of the load on ServiceX, and the rest of the write load comes later from other systems.

For better understanding, let's illustrate schematically

  • If an RO arrow enters - ServiceX reads from somewhere - load on another service

  • If an RO arrow exits - it's a load on ServiceX

    • Green - we emulated well
    • Yellow - "somehow emulated": on the side of tests of other systems
    • Red - not emulated at all

Why? Because after our tests, from a business perspective, of course, nothing went into the external world. The service wrote to its database, and a cron job cleared it, but nothing happened afterward.

Ultimately, the issue represented by the red arrow cost us dearly. As I already wrote, we fixed everything, but if we had known about this problem beforehand, there would have been no problem, and we would have fixed everything earlier. I will not go into details in this article about how we fixed the service by changing the architecture and adding sharded PostgreSQL - this article is not about that. Instead, it is about how we can now be sure that we have actually become stronger, fixed this bottleneck, and understand our available options.

A general list of reasons and problems

  1. For RW load, some "mocks" need to be created, diverting the write traffic somewhere "to the side", which is difficult.

  2. The code becomes overgrown with if-statements, making it more challenging to maintain.

  3. Representativeness is lost.

  4. As an alternative to the 3 scenarios above, test data will be recorded next to real data, spreading across systems, breaking analytics, consuming space, etc.

As soon as the is_test_user() function appears in the code, most of your tests become insufficiently representative in one way or another. That's why, in a period of high load on ServiceX, we were not adequately prepared.

Testing Strategies for Distributed Systems

Staging

  1. Firstly, staging is not indicative. The formula "if we can handle 100 RPS on one server, we can handle 1000 RPS on 10 servers" does not work here. Why? At least because of the CAP theorem. In general, an entire separate article could be written about this, but here we will just accept it as a fact – distributed systems do not scale linearly.
  2. A lot of hardware is needed, which is economically inefficient as it practically requires copying the production environment, which must also be maintained in the same way. So, staging is not a panacea.
  3. It is impossible to test geo-distribution.

Isolation

You need to create a framework that allows you to embed into the service code and make sure that everything that goes out and comes in is emulated. That is, the real load comes in, and then everything is closed at some perimeter and does not go any further.

Formally, we describe the business logic in test cases, and ServiceX loads itself. The cleaner cleans the databases according to a certain contract – it is essential to write to the databases.

Unfortunately, there is no alternative to this, as very often the database can become a bottleneck. At the same time, this should not be a separate database for_tests – it is just not representative.

Pros of such a solution

Cons of such a solution

Can test write load

Weak representativeness

Isolated testing of each system

Difficult to maintain in the code

However, of course, there are options for developing this path

  • What is the mocks layer? Is it a library/framework (something like reflection or runkit) that does not release test requests outside the service, replacing the responses?

  • Is it some service discovery and/or load-balancing solution that sends test requests to special "mock" pods?

Probably, each business needs to answer this question independently, as this discussion thread depends on the specific architecture.

Emulation

In this method, ServiceX receives marked traffic (from ServiceA), does not close in any way, and goes further – also marked further down the systems. In this scenario, the common emulation layer conducts such marked traffic, and monitors it at the middleware level - like tracing forwarding. The cleaner cleans the databases according to a certain contract, as in the example above.

Pros of such a solution

Cons of such a solution

Can test write load

Hard to maintain (not as Isolation)

High representativeness

All services must support it

However, there are options for developing this path here as well

  • What is the emulation layer? Is it a service discovery and/or load-balancing solution and middleware that marks all test requests and tracks the preservation of markup, like tracing?

Most likely, the answer to this question also depends on the specific business.


Technical details

Our ServiceX receives most events through Kafka. What options are available for testing asynchronous interactions?

The simplest method

  1. There is a time frame within which we can ensure the business process.
  2. Create a metric that looks at the volume of incoming messages from Kafka.
  3. Find out from business customers how long we can stop delivering these messages.
  4. Stop Kafka, accumulate lag, and measure the speed of processing these messages.

Pros

Cons

Highly pertinent to data structure

Always testing maximum throughput

Always testing maximum throughput

Can only be executed manually

May harm connected services

Not suitable for regular testing

SLA suffers for critical systems

Assess speed using lag only

The second simple method

  1. Use Kafka-rest proxy
  2. Create a regular proxy or a specially designed service with asynchronous interaction
  3. Can apply any load

Pros

Cons

Can apply any load

High overhead (REST→Kafka)

Can test on Production

Extra service layer may harm outcomes

Assess speed via lag and USE metrics

View load on RED metrics by generator

Analysis complexity increases

Unified Kafka load generator

Both of these options did not satisfy us, and we decided to implement our solution, which we called a unified Kafka load generator.

What can this service do?

  1. Can pour any data into any topic with a given intensity

  2. Automatically set test flags in the Header

  3. Can consume statistics from a special topic and mix them with the results

  4. Allows loading the topology

  5. See the result of the load in the generator metrics

  6. Does not require special solutions for statistics aggregation


Conclusion

In this article, we have thoroughly examined various methods and strategies for performance testing in distributed systems, focusing on asynchronous interactions and the use of Kafka. We have described the pros and cons of staging, isolation, emulation, and other approaches, allowing specialists to choose the most suitable method for their projects.

We have also discussed technical aspects and presented two simple ways to test asynchronous interactions, as well as a unique solution - a unified load generator for Kafka, which offers flexible capabilities and advantages for testing.

It is important to remember that the development and maintenance of a performance system is a complex and ongoing process, and there is no universal solution that would suit everyone. It is crucial to analyze the specifics of a particular business and system architecture in order to choose the most optimal approach to performance testing.


Written by boomeer | Technical Project Manager
Published by HackerNoon on 2023/04/20