paint-brush
Switching from Elastic Stack to Grafana: A Cost-Cutting Success Storyby@chep
1,124 reads
1,124 reads

Switching from Elastic Stack to Grafana: A Cost-Cutting Success Story

by Andrii ChepikAugust 2nd, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Experienced DevOps engineer shares the story of migrating logging and tracing services from Elastic Stack to Grafana Stack. The team faced challenges with the previous setup due to increasing data volume and resource consumption. They switched to Grafana Loki for logging and Tempo for tracing. Benefits included reduced resource consumption, no need for licenses, simplified customization, and unified monitoring. Challenges included ensuring performance, optimizing caching, and achieving desired speeds. Overall, the migration led to significant cost reduction and better system organization, though ongoing improvements are ongoing.
featured image - Switching from Elastic Stack to Grafana: A Cost-Cutting Success Story
Andrii Chepik HackerNoon profile picture


Hi! I am an experienced DevOps engineer and I decided to participate in the contest by Hackernoon and Aptible.


I want to share the story of migrating logging and tracing services from Elastic Stack to Grafana Stack and what came out of it. Before the migration, my team used quite classical schemes:


  • Logstash → Elasticsearch → Kibana for logs
  • Jaeger → Elasticsearch → Kibana (Jaeger UI) for tracers


This is a common variant among projects. It suited us for the first year and a half of the project's life. But as time passed, microservices proliferated like mushrooms after rain, and the volume of client requests grew. Expansion of resources of logging and tracing systems became more and more frequent. More and more storage and computing power was required. On top of that, the X-Pack license was pushing the price even higher. When problems with licenses and access to Elastic's products began to appear, it became clear that we couldn't go on living like this.


While searching for the best solution, we tried different variants of components, wrote Kubernetes operators, and collected two buckets of trouble. In the end, the schemes took the following form:


  • Vector → Loki → Grafana
  • Jaeger → Tempo → Grafana


This is how we managed to unite the 3 most important aspects of monitoring: metrics, logs, and tracers in one Grafana workspace and get several benefits from it. The main ones are:


  • Reduced hardware space for the same amount of data
  • Reduced amount of computational resources for system operation
  • No need to purchase a license
  • Free access to the product
  • Simple enough customization of the autoscaling mechanism


I hope this article will be helpful for those who are just choosing a logging/tracing system and those with similar difficulties.


Migration preconditions: logging

It was 2022, and as mentioned above, we were using a fairly standard centralized logging scheme.




About 1 TB of logs were accumulated per day. The cluster consisted of about 10 Elasticsearch data nodes, X-Pack license was purchased (mainly for domain authorization and alerting). Applications were mostly deployed in the Kubernetes cluster. Fluent Bit was used to send them, followed by Kafka as a buffer and a Logstash pool under each namespace.


In the process of system operation, we encountered various problems. Some of them were solved quite easily. For some, only a workaround was suitable, and some could not be solved at all. Combining the second and third groups of problems prompted us to search for another solution. Let me first list the 4 most significant of these problems.


Loss of logs during collection

Strangely enough, the first component that started having problems was Fluent Bit. From time to time, it stopped sending individual Kubernetes pod logs. Analyzing debug logs, tuning buffers, and updating the version did not produce the desired effect. Vector was taken as a replacement. As it turned out later, it also had similar problems. But it was fixed in version 0.21.0.


Workarounds with DLQ

The next annoyance was the forced use of workarounds when enabling DLQ on Logstash. Logstash does not know how to rotate logs that get into this queue. As an almost official workaround, it was suggested to simply restart the instance after reaching the threshold volume. This did not affect the system negatively since Kafka was used as input, and the service was terminated in graceful mode. But it was a pain to see the constantly growing number of restarts of pods. Restarts sometimes masked other problems.


Painful writing alerts

Not a very convenient description of alerting rules. You can click through the Kibana web interface, but using the description through code is more convenient, like in Prometheus. The syntax is rather non-obvious and vomitous.


Resource consumption

But the main problems were related to the increasing cost of resources consumed and, as a result, the cost of the system. The most greedy components turned out to be Logstash and Elasticsearch because JVM is notoriously partial to the amount of memory.


Tracing

Jaeger was used for centralized tracers collection. It sent data via Kafka to a separate Elastics cluster. Tracers were not getting smaller. We had to scale the system to accommodate hundreds of gigabytes of traces daily.


The scheme looked like this:


A common inconvenience was also the use of different web interfaces for different aspects of monitoring:


  • Grafana - for metrics
  • Kibana - for logs
  • Jaeger UI - for tracers


Of course, Grafana allows both Elasticsearch and Jaeger to be connected as data sources, but the need to manipulate different data query syntaxes remains.


Our search for new logging and tracing solutions began with these assumptions. We did not conduct any comparative analysis of different systems. Now on the market, there are mostly serious products that require licensing. So we decided to take the open-source project Grafana Loki, which had already been successfully implemented in some companies.


Migration to Loki

So, why Loki was chosen:


  • It is an opensource product that allows us to implement all the features we need
  • Loki components consume significantly fewer resources under the same loads
  • All components can be run in a Kubernetes cluster, and you can use HPA or Keda to scale them automatically
  • Data takes up several times less space because it is stored in compressed form
  • Query construction is very similar to PromQL
  • The description of alerts is similar to Prometheus alerts
  • The system is perfectly integrated with Grafana for message visualization and graphing



Distributed helm-chart was chosen as the deployment method. Object Storage was chosen as the chunk storage. The more lightweight Vector replaced Logstash. The system is quite functional out of the box at low volumes (several hundred messages per second). Logs can be saved almost in real-time mode, search for fresh data works almost as fast as in Kibana. The resulting scheme looks as follows.




The best settings in Loki

As the load increases, the recording chain starts to suffer. Both ingester and distributor start dropping logs, returning timeouts, etc. And when requesting data missing in the ingester's cache, the response timeout starts to approach a minute or even drop by timeout. Just in case, here is a diagram of how the installation of Loki via Distributed chart looks like.



In case of recording problems, it is worth paying attention to the parameters:


limits_config.ingestrion_burst _size_mb
limits_config.rate_mb


They are responsible for bandwidth. When traffic approaches the thresholds, messages will be discarded.


To increase search speed, you should use Memcached or Redis caching. You can also play with the parameters:


limits.config.split_queries_by_interal
frontend_worker.parallelisim


Loki can use caching of 4 types of data:


  • Chunks
  • Indexes
  • Responses to previous queries
  • Data caching for deduplication needs


It is worth using at least the first three.


After tuning component settings, scaling, and enabling the cache, logging delays disappeared, and the search began to work in an acceptable time (within 10 seconds).


As for alerting rules, they are written similarly to Prometheus rules.


- alert: low_log_rate_common

  expr: sum(count_over_time({namespace="common"}[15m])) < 50

  for: 5m

  labels:

      severity: warning

      annotations:

        summary: Count is less than 50 from {{ $labels.namespace }}. Current VALUE = {{ $value }}


LogQL query language is used to access messages. It is similar in syntax to PromQL. In terms of visualization, everything looks like in Kibana.


A very convenient feature is setting up links from log messages to the corresponding tracers. In this case, the tracer is opened when you click on the link in the right half of the workspace.


Migration to Tempo

Tempo is a younger product (introduced in 2020) with a very similar architecture to Loki. Distributed helm-chart was chosen as the deployment method.


Distributors can connect to Kafka directly, but in this form, it was impossible to achieve read speeds commensurate with write speeds. By using the Grafana agent, it was possible to solve this problem.




The best settings in Tempo

The following points should be paid attention to during the configuration process:


  • If Jaeger is used, spans from Grafana agent to Distributor will be sent via GRPC. In a Kubernetes cluster, even distribution to Distributors will require configuring a balancer such as Envoy (optionally via Istio).
  • In case of write speed issues, it is worth increasing the parameters:
    • overrides.ingestion_burst_size_bytes
    • overrides.ingestion_rate_limit_bytes
    • overrides.max_bytes_per_trace
  • You can also increase the timeout for waiting for a response from ingester - ingester_client.remote_timeout. They don't always manage to respond in 5 seconds.


In addition to the tracer visualization, you can also get a connected graph showing all the components involved in processing the request.



Conclusion

From the initial installation of Loki in the dev environment to its implementation in production took about 2 months. Tempo took the same amount of time.


As a result, we were able to:


  • Reduce the cost of the systems by about 7 times
  • Combine visualization of metrics, logs, and tracers in a single system
  • Organize alerting in a similar way to Prometheus
  • Set up autoscaling of the system depending on the amount of incoming data


It is worth mentioning what we have not managed to achieve yet:


  • A log output speed similar to Elastic if the data is not in the cache. This is partially solved by adding the most frequently used fields to the index.
  • Obtaining aggregated values for a long period (more than a day). We mean, for example, counting the number of log messages with the count_over_time function.
  • Performance in saving spans as in the scheme with Elasticsearch. So far, we managed to solve it by horizontal scaling and increasing the number of partitions in Kafka.


In general, I can say that the experience of switching to Grafana Stack was successful, but the process of improving the logging and tracing schemes did not end.