Load Testing for High-Load Distributed Systems

We have a service that interacts with the external world and is very high-load (we estimated and prepared for 100 RPS). Let's call this service ServiceX, a kind of proxy-orchestrator service. This service communicates with a set of our internal services and looks into the external world.

We tested this service and thought we were ready for high loads, but that turned out not to be the case. Yes, we managed to save the service while under fire, and it took us less than two days to do so.

However, as a result, we recognized the need to be better prepared for such problems in advance, not when the service is being bombarded with 100 RPS, and the business suffers from a malfunction.

The cornerstone of the problem

As it turned out, the write load we emulated accounts for only 5% of the load on ServiceX, and the rest of the write load comes later from other systems.

For better understanding, let's illustrate schematically

If an RO arrow enters - ServiceX reads from somewhere - load on another service
If an RO arrow exits - it's a load on ServiceX
- Green - we emulated well
- Yellow - "somehow emulated": on the side of tests of other systems
- Red - not emulated at all

Why? Because after our tests, from a business perspective, of course, nothing went into the external world. The service wrote to its database, and a cron job cleared it, but nothing happened afterward.

Ultimately, the issue represented by the red arrow cost us dearly. As I already wrote, we fixed everything, but if we had known about this problem beforehand, there would have been no problem, and we would have fixed everything earlier. I will not go into details in this article about how we fixed the service by changing the architecture and adding sharded PostgreSQL - this article is not about that. Instead, it is about how we can now be sure that we have actually become stronger, fixed this bottleneck, and understand our available options.

A general list of reasons and problems

For RW load, some "mocks" need to be created, diverting the write traffic somewhere "to the side", which is difficult.
The code becomes overgrown with if-statements, making it more challenging to maintain.
Representativeness is lost.
As an alternative to the 3 scenarios above, test data will be recorded next to real data, spreading across systems, breaking analytics, consuming space, etc.

As soon as the is_test_user() function appears in the code, most of your tests become insufficiently representative in one way or another. That's why, in a period of high load on ServiceX, we were not adequately prepared.

Testing Strategies for Distributed Systems

Staging

Firstly, staging is not indicative. The formula "if we can handle 100 RPS on one server, we can handle 1000 RPS on 10 servers" does not work here. Why? At least because of the CAP theorem. In general, an entire separate article could be written about this, but here we will just accept it as a fact – distributed systems do not scale linearly.
A lot of hardware is needed, which is economically inefficient as it practically requires copying the production environment, which must also be maintained in the same way. So, staging is not a panacea.
It is impossible to test geo-distribution.

Isolation

You need to create a framework that allows you to embed into the service code and make sure that everything that goes out and comes in is emulated. That is, the real load comes in, and then everything is closed at some perimeter and does not go any further.

Formally, we describe the business logic in test cases, and ServiceX loads itself. The cleaner cleans the databases according to a certain contract – it is essential to write to the databases.

Unfortunately, there is no alternative to this, as very often the database can become a bottleneck. At the same time, this should not be a separate database for_tests – it is just not representative.

Pros of such a solution	Cons of such a solution
Can test write load	Weak representativeness
Isolated testing of each system	Difficult to maintain in the code

However, of course, there are options for developing this path

What is the mocks layer? Is it a library/framework (something like reflection or runkit) that does not release test requests outside the service, replacing the responses?
Is it some service discovery and/or load-balancing solution that sends test requests to special "mock" pods?

Probably, each business needs to answer this question independently, as this discussion thread depends on the specific architecture.

Emulation

In this method, ServiceX receives marked traffic (from ServiceA), does not close in any way, and goes further – also marked further down the systems. In this scenario, the common emulation layer conducts such marked traffic, and monitors it at the middleware level - like tracing forwarding. The cleaner cleans the databases according to a certain contract, as in the example above.

Pros of such a solution	Cons of such a solution
Can test write load	Hard to maintain (not as Isolation)
High representativeness	All services must support it

However, there are options for developing this path here as well

What is the emulation layer? Is it a service discovery and/or load-balancing solution and middleware that marks all test requests and tracks the preservation of markup, like tracing?

Most likely, the answer to this question also depends on the specific business.

Technical details

Our ServiceX receives most events through Kafka. What options are available for testing asynchronous interactions?

The simplest method

There is a time frame within which we can ensure the business process.
Create a metric that looks at the volume of incoming messages from Kafka.
Find out from business customers how long we can stop delivering these messages.
Stop Kafka, accumulate lag, and measure the speed of processing these messages.

Pros	Cons
Highly pertinent to data structure	Always testing maximum throughput
Always testing maximum throughput	Can only be executed manually
	May harm connected services
	Not suitable for regular testing
	SLA suffers for critical systems
	Assess speed using lag only

The second simple method

Use Kafka-rest proxy
Create a regular proxy or a specially designed service with asynchronous interaction
Can apply any load

Pros	Cons
Can apply any load	High overhead (REST→Kafka)
Can test on Production	Extra service layer may harm outcomes
	Assess speed via lag and USE metrics
	View load on RED metrics by generator
	Analysis complexity increases

Unified Kafka load generator

Both of these options did not satisfy us, and we decided to implement our solution, which we called a unified Kafka load generator.

What can this service do?

Can pour any data into any topic with a given intensity
Automatically set test flags in the Header
Can consume statistics from a special topic and mix them with the results
Allows loading the topology
See the result of the load in the generator metrics
Does not require special solutions for statistics aggregation

Conclusion

In this article, we have thoroughly examined various methods and strategies for performance testing in distributed systems, focusing on asynchronous interactions and the use of Kafka. We have described the pros and cons of staging, isolation, emulation, and other approaches, allowing specialists to choose the most suitable method for their projects.

We have also discussed technical aspects and presented two simple ways to test asynchronous interactions, as well as a unique solution - a unified load generator for Kafka, which offers flexible capabilities and advantages for testing.

It is important to remember that the development and maintenance of a performance system is a complex and ongoing process, and there is no universal solution that would suit everyone. It is crucial to analyze the specifics of a particular business and system architecture in order to choose the most optimal approach to performance testing.