We have a service that interacts with the external world and is very high-load (we estimated and prepared for 100 RPS). Let's call this service , a kind of proxy-orchestrator service. This service communicates with a set of our internal services and looks into the external world. ServiceX We tested this service and thought we were ready for high loads, but that turned out not to be the case. Yes, we managed to save the service while under fire, and it took us less than two days to do so. However, as a result, we recognized the need to be better prepared for such problems in advance, not when the service is being bombarded with 100 RPS, and the business suffers from a malfunction. The cornerstone of the problem As it turned out, the write load we emulated accounts for only 5% of the load on , and the rest of the write load comes later from other systems. ServiceX For better understanding, let's illustrate schematically If an RO arrow enters - ServiceX reads from somewhere - load on another service If an RO arrow exits - it's a load on ServiceX Green - we emulated well Yellow - "somehow emulated": on the side of tests of other systems Red - not emulated at all Why? Because after our tests, from a business perspective, of course, nothing went into the external world. The service wrote to its database, and a cron job cleared it, but nothing happened afterward. Ultimately, the issue represented by the red arrow cost us dearly. As I already wrote, we fixed everything, but if we had known about this problem beforehand, there would have been no problem, and we would have fixed everything earlier. I will not go into details in this article about how we fixed the service by changing the architecture and adding sharded - this article is not about that. Instead, it is about how we can now be sure that we have actually become stronger, fixed this bottleneck, and understand our available options. PostgreSQL A general list of reasons and problems For RW load, some "mocks" need to be created, diverting the write traffic somewhere "to the side", which is difficult. The code becomes overgrown with if-statements, making it more challenging to maintain. Representativeness is lost. As an alternative to the 3 scenarios above, test data will be recorded next to real data, spreading across systems, breaking analytics, consuming space, etc. As soon as the function appears in the code, most of your tests become insufficiently representative in one way or another. That's why, in a period of high load on , we were not adequately prepared. is_test_user() ServiceX Testing Strategies for Distributed Systems Staging Firstly, staging is not indicative. The formula "if we can handle 100 RPS on one server, we can handle 1000 RPS on 10 servers" does not work here. Why? At least because of the . In general, an entire separate article could be written about this, but here we will just accept it as a fact – do not scale linearly. CAP theorem distributed systems A lot of hardware is needed, which is economically inefficient as it practically requires copying the production environment, which must also be maintained in the same way. So, staging is not a panacea. It is impossible to test geo-distribution. Isolation You need to create a framework that allows you to embed into the service code and make sure that everything that goes out and comes in is emulated. That is, the real load comes in, and then everything is closed at some perimeter and does not go any further. Formally, we describe the business logic in test cases, and ServiceX loads itself. The cleaner cleans the databases according to a certain contract – it is essential to write to the databases. Unfortunately, there is no alternative to this, as very often the database can become a bottleneck. At the same time, this should not be a separate database – it is just not representative. for_tests Pros of such a solution Cons of such a solution Can test write load Weak representativeness Isolated testing of each system Difficult to maintain in the code However, of course, there are options for developing this path What is the mocks layer? Is it a library/framework (something like or ) that does not release test requests outside the service, replacing the responses? reflection runkit Is it some service discovery and/or load-balancing solution that sends test requests to special "mock" pods? Probably, each business needs to answer this question independently, as this discussion thread depends on the specific architecture. Emulation In this method, receives marked traffic (from ), does not close in any way, and goes further – also marked further down the systems. In this scenario, the common emulation layer conducts such marked traffic, and monitors it at the middleware level - like tracing forwarding. The cleaner cleans the databases according to a certain contract, as in the example above. ServiceX ServiceA Pros of such a solution Cons of such a solution Can test write load Hard to maintain (not as ) Isolation High representativeness All services must support it However, there are options for developing this path here as well What is the emulation layer? Is it a service discovery and/or load-balancing solution and middleware that marks all test requests and tracks the preservation of markup, like tracing? Most likely, the answer to this question also depends on the specific business. Technical details Our receives most events through . What options are available for testing asynchronous interactions? ServiceX Kafka The simplest method There is a time frame within which we can ensure the business process. Create a metric that looks at the volume of incoming messages from Kafka. Find out from business customers how long we can stop delivering these messages. Stop Kafka, accumulate lag, and measure the speed of processing these messages. Pros Cons Highly pertinent to data structure Always testing maximum throughput Always testing maximum throughput Can only be executed manually May harm connected services Not suitable for regular testing SLA suffers for critical systems Assess speed using lag only The second simple method Use Kafka-rest proxy Create a regular proxy or a specially designed service with asynchronous interaction Can apply any load Pros Cons Can apply any load High overhead (REST→Kafka) Can test on Production Extra service layer may harm outcomes Assess speed via lag and USE metrics View load on RED metrics by generator Analysis complexity increases Unified Kafka load generator Both of these options did not satisfy us, and we decided to implement our solution, which we called a unified Kafka load generator. What can this service do? Can pour any data into any topic with a given intensity Automatically set test flags in the Header Can consume statistics from a special topic and mix them with the results Allows loading the topology See the result of the load in the generator metrics Does not require special solutions for statistics aggregation Conclusion In this article, we have thoroughly examined various methods and strategies for performance testing in distributed systems, focusing on asynchronous interactions and the use of Kafka. We have described the pros and cons of staging, isolation, emulation, and other approaches, allowing specialists to choose the most suitable method for their projects. We have also discussed technical aspects and presented two simple ways to test asynchronous interactions, as well as a unique solution - a unified load generator for Kafka, which offers flexible capabilities and advantages for testing. It is important to remember that the development and maintenance of a performance system is a complex and ongoing process, and there is no universal solution that would suit everyone. It is crucial to analyze the specifics of a particular business and system architecture in order to choose the most optimal approach to performance testing.