In 1697, Melvin Conway advanced a famous thesis. This thesis gets a mention in almost every guide to creating a microservice architecture — and for a good reason: more than one generation of developers has seen the thesis confirmed.
But suppose a company's communications structure changes due to development and entry into new international markets. In this case, the product itself must change to meet users’ needs in these new markets.
In the summer of 2020, our team was faced with preparing a product for entry into international markets. At that time, we had an extensive set of microservices designed and run in line with the company’s outdated organizational structure. Retaining this ragbag of services was difficult and expensive. What’s more, many of the services bore no connection with the business requirements or technological trends of the day.
Confronting these new realities, we knew we needed to:
● simplify the current architecture by getting rid of redundant data transfer between services ● improve the transit time of data to the end-user ● reformat teams to develop and maintain a new solution
Our Solution
Over the years, .NET became the company’s main back-end stack: we built up a significant code base, libraries for implementing extra functionalities, and expertise in C# development. So, we wanted to make the most of the previous experience during the development phase and not switch unnecessarily to a new stack.
We started writing a service using .NET Core 3.1 and later migrated to .NET 5 quite quickly and painlessly. All our services are hosted using Kubernetes. The correct configuration of CI/CD falls only partially on the shoulders of the development team—the main approaches and templates for this are developed and supported by a separate DevOps team.
Using Kafka, we can transfer data between our services in the form of messages containing a complete condition snapshot of a particular business entity. At the same time, snapshots change at very different rates. For example, taxonomy (data on championships, sports events, players) changes every few seconds, and operational data (markeе coefficients) changes several times per second.
Based on this, one of the main requirements for the new service was the need to support high loads when delivering data to clients. It was originally designed to avoid the overhead of using external caches or databases. Therefore, when starting, the service reads data from Kafka topics into RAM, performing aggregation and distributing updates (in fact, updates are the difference between already sent and received data) reactively using the Rx.NET functionality via sockets on the front-ends, pre-serializing the data into a binary format using Messagepack. SignalR provides a framework for working with sockets.
Issue №1: Memory
Outdated data in Kafka has to be cleared so that only the data Kafka requires is read at the start of the service (current taxonomy and trading data). Otherwise, we’re forced to go through the topic from the earliest offsets, trying to decide whether the data is relevant in this offset and whether it should be loaded into memory, which is an unnecessary overhead.
However, as Kafka is essentially a commit log and not a random access database, simply writing "Delete * From" won’t work (without additional services from Confluent). Therefore, a service that provides Kafka with data delivery if data losses relevance should send nulling using the value=null key—a tombstone telling Kafka the data is not relevant and can be "buried" when executing retention policies. Actual data should also be compact because when operating with snapshots, we are only interested in the last state of the entity. In this case, you need to specify the "compact" parameter as the retention policy in the topic settings.
Another memory issue is the suboptimal use of data aggregations. For example, to join in Rx.NET, the data on the left and right must be pre-filtered to avoid unnecessary waste of resources.
It’s also worth noting the benefits of keeping aggregates rather than raw data in memory. The more efficient use of LOH means that fewer objects get there—and the smaller they are, the less likely it is to get excessive heap fragmentation and OutOfMemory exception when allocating a large array. Fragmentation can happen when executing the ToList() method on IEnumerable—when it looks as though there’s still enough memory. Nevertheless, due to LOH fragmentation, the system cannot allocate enough of it. To check for heap fragmentation since the last garbage collection, initiate GC.GetMemoryInfo and look at the FragmentedBytes field. It shows the number of fragmented bytes in a heap—for example, "fragmentedBytes": 1669556304.
Well, the axiomatic rule is to omit superfluous allocations altogether unnecessarily: passing each enumerator of a thread-safe collection is more profitable than making copies of this collection for multithreaded processing.
Issue №2: Performance
As noted, performance and the ability to distribute data to clients with an average latency <200 msec are essential for a service. Only updates need to be sent to clients to achieve this requirement—the difference between the previous and current state of the data. We use a structure of the Diff<T> type to do this, where T is the ViewModel, in which each field is filled only with the changed data. This approach saves on serialization and forwarding channel speed but requires more CPU to compute the difference for each object sent. In addition, updates are combined into batches so that not every minor update is sent to the client but buffer, either by quantity or amount of time.
Garbage Collector settings have a very significant impact on performance. Of course, it’s necessary to strive for the minimum use of LOH to optimize the memory work, and when it is essential to do this, strive to minimize the size of the objects that get there.
The default size LOH is 85000 bytes, but actually, you can experiment with increasing this value. For example, if you’re sure your objects don’t exceed 120,000 bytes and you can’t make them any smaller, you can try increasing the System.GC.LOHThreshold parameter to the appropriate value.
Server garbage collection with concurrent mode also provides a significant advantage when optimizing data latency. For instance, enabling the corresponding parameters <ConcurrentGarbageCollection>true</ConcurrentGarbageCollection> and <ServerGarbageCollection>true</ServerGarbageCollection> made it possible to improve latency by about 20% compared to ServerGarbageCollection = false, albeit at the expense of a slight increase in CPU usage.
The transition from .NET Core 3.1 to .NET 5.0, which was relatively easy with the assembly update and serialization edits, made it possible to improve service performance by about 15% without changing the code. Accordingly, the periodic update of the frameworks and libraries being used brings benefits at minimal developer cost, although there may be some exceptions. Yes, writing and maintaining tests in an up-to-date state requires team resources. Still, they allow you to estimate the impact of new features or fresh optimizations on how the service will behave in production and how much its requirements for memory and CPU will change.
Issue №3: Compatibility
Front-ends (Android/iOS/Web) subscribe to changes to operational data over a set of identifiers using web sockets. We use SignalR on the back-end as a "wrapper" over web sockets. However, at that time, we didn’t find any well-optimized and efficient libraries for supporting SignalR on the Android and iOS side for our requirements.
Therefore, the front-ends had to fully support the SignalR binary protocol's features: consider the presence of hardbits, encoding several entities in one message by adding information about the length of data subsets (payloads) in the VarInt format to the message, etc. It added some work, but on the other hand, thanks to the well-written documentation of the SignalR framework, it allowed us to implement our custom transport library on the front side, and our teams fully support it.
To describe contracts and passed types, we had to generate a custom JSON scheme. It contains additional information about the optionality of the field, a list of valid values for enums that are sent to the front-end in the form of int's, and other descriptors that help the front-ends create subscriptions parsing the received data. When contracts need to be changed, a new subscription version is created that supports the changes. Until all of our client applications migrate to new subscriptions, we have to maintain two subscription options simultaneously.
Issue №4: Monitoring
To monitor the operation of the service effectively requires logs and metrics, of course. Everything is more or less standard here: ELK for logs and Prometheus/Grafana for taking metrics and monitoring them in nearly real-time. The more metrics you measure, the better. That’s the rule. Even a metric like the lack of new data for some time can indicate that something is wrong—problems with data mirroring between Kafka clusters, for example, or the service lags out due to an unhandled error in the subscription via Rx.NET.
We also consider the time it took for the message to reach our service, bypassing all previous transfers, as an essential parameter. It allows us to evaluate the message’s latency within our service and throughout the entire system. The metric showing the difference between generated batches that are to be sent to the client and disposed batches allows us to evaluate memory leaks and the efficiency of the mechanism for multiplying data to different subscribers.
Issue №5: Scaling and Hosting
To use hosting resources efficiently while avoiding performance drops with a large number of connections in prime time, it’s important that autoscaling is as flexible as possible. It should also respond to metrics that allow you to balance between resource utilization and good data latency for the end-user. Metric options for scaling should consider the load on the CPU, the signal flow rate from Kafka at the input to the sockets at the output, and the amount of memory consumed. Degradation in this metric can signal to the orchestrator that it’s time to bring up another service instance.
Moreover, the memory metric is an inflexible indicator due to the non-deterministic nature of GC memory cleaning. You can experiment with the System.GC.HighMemoryPercent parameter and achieve more aggressive GC work, not at 90% of the occupied memory (by default), but 80%. It will allow autoscaling to consume resources more efficiently.
We continue to face some difficulties, but we can already pinpoint several conclusions:
We hope that our experience will help you to prepare for the challenges of building high-load services. Good luck!