We have a service that interacts with the external world and is very high-load (we estimated and prepared for 100 RPS). Let's call this service ServiceX, a kind of proxy-orchestrator service. This service communicates with a set of our internal services and looks into the external world.
We tested this service and thought we were ready for high loads, but that turned out not to be the case. Yes, we managed to save the service while under fire, and it took us less than two days to do so.
However, as a result, we recognized the need to be better prepared for such problems in advance, not when the service is being bombarded with 100 RPS, and the business suffers from a malfunction.
As it turned out, the write load we emulated accounts for only 5% of the load on ServiceX, and the rest of the write load comes later from other systems.
For better understanding, let's illustrate schematically
If an RO arrow enters - ServiceX reads from somewhere - load on another service
If an RO arrow exits - it's a load on ServiceX
Why? Because after our tests, from a business perspective, of course, nothing went into the external world. The service wrote to its database, and a cron job cleared it, but nothing happened afterward.
Ultimately, the issue represented by the red arrow cost us dearly. As I already wrote, we fixed everything, but if we had known about this problem beforehand, there would have been no problem, and we would have fixed everything earlier. I will not go into details in this article about how we fixed the service by changing the architecture and adding sharded PostgreSQL - this article is not about that. Instead, it is about how we can now be sure that we have actually become stronger, fixed this bottleneck, and understand our available options.
For RW load, some "mocks" need to be created, diverting the write traffic somewhere "to the side", which is difficult.
The code becomes overgrown with if-statements, making it more challenging to maintain.
Representativeness is lost.
As an alternative to the 3 scenarios above, test data will be recorded next to real data, spreading across systems, breaking analytics, consuming space, etc.
As soon as the is_test_user()
function appears in the code, most of your tests become insufficiently representative in one way or another. That's why, in a period of high load on ServiceX, we were not adequately prepared.
You need to create a framework that allows you to embed into the service code and make sure that everything that goes out and comes in is emulated. That is, the real load comes in, and then everything is closed at some perimeter and does not go any further.
Formally, we describe the business logic in test cases, and ServiceX loads itself. The cleaner cleans the databases according to a certain contract – it is essential to write to the databases.
Unfortunately, there is no alternative to this, as very often the database can become a bottleneck. At the same time, this should not be a separate database for_tests
– it is just not representative.
Pros of such a solution |
Cons of such a solution |
---|---|
Can test write load |
Weak representativeness |
Isolated testing of each system |
Difficult to maintain in the code |
However, of course, there are options for developing this path
What is the mocks layer? Is it a library/framework (something like reflection
or runkit
) that does not release test requests outside the service, replacing the responses?
Is it some service discovery and/or load-balancing solution that sends test requests to special "mock" pods?
Probably, each business needs to answer this question independently, as this discussion thread depends on the specific architecture.
In this method, ServiceX receives marked traffic (from ServiceA), does not close in any way, and goes further – also marked further down the systems. In this scenario, the common emulation layer conducts such marked traffic, and monitors it at the middleware level - like tracing forwarding. The cleaner cleans the databases according to a certain contract, as in the example above.
Pros of such a solution |
Cons of such a solution |
---|---|
Can test write load |
Hard to maintain (not as Isolation) |
High representativeness |
All services must support it |
However, there are options for developing this path here as well
What is the emulation layer? Is it a service discovery and/or load-balancing solution and middleware that marks all test requests and tracks the preservation of markup, like tracing?
Most likely, the answer to this question also depends on the specific business.
Our ServiceX receives most events through Kafka. What options are available for testing asynchronous interactions?
Pros |
Cons |
---|---|
Highly pertinent to data structure |
Always testing maximum throughput |
Always testing maximum throughput |
Can only be executed manually |
|
May harm connected services |
|
Not suitable for regular testing |
|
SLA suffers for critical systems |
|
Assess speed using lag only |
Pros |
Cons |
---|---|
Can apply any load |
High overhead (REST→Kafka) |
Can test on Production |
Extra service layer may harm outcomes |
|
Assess speed via lag and USE metrics |
|
View load on RED metrics by generator |
|
Analysis complexity increases |
Both of these options did not satisfy us, and we decided to implement our solution, which we called a unified Kafka load generator.
What can this service do?
Can pour any data into any topic with a given intensity
Automatically set test flags in the Header
Can consume statistics from a special topic and mix them with the results
Allows loading the topology
See the result of the load in the generator metrics
Does not require special solutions for statistics aggregation
In this article, we have thoroughly examined various methods and strategies for performance testing in distributed systems, focusing on asynchronous interactions and the use of Kafka. We have described the pros and cons of staging, isolation, emulation, and other approaches, allowing specialists to choose the most suitable method for their projects.
We have also discussed technical aspects and presented two simple ways to test asynchronous interactions, as well as a unique solution - a unified load generator for Kafka, which offers flexible capabilities and advantages for testing.
It is important to remember that the development and maintenance of a performance system is a complex and ongoing process, and there is no universal solution that would suit everyone. It is crucial to analyze the specifics of a particular business and system architecture in order to choose the most optimal approach to performance testing.