Running systems in production involve requirements for high availability, resilience and recovery from failure. When running cloud-native applications this becomes even more critical, as the base assumption in such environments is that compute nodes will suffer outages, Kubernetes nodes will go down and microservices instances are likely to fail, yet the service is expected to remain up and running.
In a recent post, I presented the different Jaeger components and best practices for deploying Jaeger in production. In that post, I mentioned that Jaeger uses external services for ingesting and persisting the span data, such as Elasticsearch, Cassandra and Kafka. This is due to the fact that the Jaeger Collector is a stateless service and you need to point it to some sort of storage to which it will forward the span data.
In this post, I’d like to discuss how to ingest and persist Jaeger trace data in production to ensure resilience and high availability, and the external services you need to set up for that. I’ll cover:
Deploying Jaeger with Elasticsearch, Kafka or other External Services
Jaeger deployments may involve additional services such as Elasticsearch, Cassandra and Kafka. But do these services come as part of Jaeger’s installation and how are these services deployed?
The Jaeger Operator and Jaeger’s Helm chart (see Jaeger’s deployment tools on this post) offer the option of a self-provisioned Elasticsearch/Cassandra/Kafka cluster (in which Jaeger deployment also deploys these clusters), as well as the option of connecting to an existing cluster.
The self-provisioned option offers a good starting point, but you may prefer to deploy these services independently for better flexibility and control over the way these clusters are deployed, managed, monitored, upgraded and secured, in accordance with your team’s DevOps practices.
In particular, if you are already running a Kafka or Elasticsearch cluster, it may make more sense to re-use these infrastructure components rather than maintain a separate cluster.
For production deployments, Jaeger currently provides built-in support for two storage solutions, both of which are very popular open-source NoSQL databases: Elasticsearch and Cassandra. The Jaeger collector and query service need to be configured with the storage solution of choice so they can write to it and query it. You can pass the desired storage type and the database endpoint via environment variables. For example, a basic Elasticsearch setup will define the following environment variables:
SPAN_STORAGE_TYPE=elasticsearch
ES_SERVER_URLS=<...>
Caption: Illustration of direct-to-storage architecture. Source: jaegertracing.io
So which storage backend should you use: Elasticsearch or Cassandra?
The Jaeger team provides a clear recommendation to use Elasticsearch as the storage backend over Cassandra. And they have very good reasons:
One benefit of Cassandra backend is simplified maintenance due to its native support for data TTL. In Elasticsearch the data expiration is managed through index rotation, which requires additional setup (see Elasticsearch Rollover).
In addition to Jaeger’s built-in support for Elasticsearch and Cassandra, Jaeger supports a gRPC plugin (
SPAN_STORAGE_TYPE=grpc-plugin
) which enables developing custom plugins to other storage types. The Jaeger community currently offers integrations with about a dozen persistent storage types, four of which are defined as ‘available’ at present: ScyllaDB, InfluxDB, Couchbase and Logz.io (disclaimer: I work at Logz.io).Other integrations, which are not yet available, include NoSQL data stores from the big cloud vendors such as Amazon DynamoDB, Azure CosmosDB and Google BigTable, as well as popular SQL databases MySQL and PostgreSQL. You can check out the list of additional storage backends and updated status on this Jaeger GitHub issue.
If you monitor many microservices, if you have a high volume of span data, or if your system generates data bursts on occasions, then your external backend storage may not be able to handle the load and may become a bottleneck, impacting the overall performance. In such cases you should employ the streaming deployment strategy that I mentioned in the previous post which uses Kafka between the Collector and the storage to buffer the span data from the Jaeger Collector.
Caption: Illustration of architecture with Kafka as intermediate buffer. Source: jaegertracing.io
In this case, you configure Kafka as the target for Jaeger Collector (
SPAN_STORAGE_TYPE=kafka
) as well as the relevant Kafka brokers, topic and other parameters.I’d like to stress that Kafka is not an alternative backend storage (although the setting
SPAN_STORAGE_TYPE=kafka
may be confusing). Your Jaeger backend still needs a backend storage as described in the previous sections, with Kafka serving as a buffer to take off the pressure.To support the streaming deployment Jaeger project also offers the Jaeger Ingester service, which can asynchronously read from Kafka topic and write to the storage backend (Elasticsearch or Cassandra). Of course, you can choose to implement your own service to do the same, if you need a particular target storage or ingestion strategy.
Until now I've discussed production deployment. However, if you are exploring Jaeger or are doing a small PoC or development, then you are probably using Jaeger’s All-in-One installation, and you may be wondering how this is applicable to you.
All-in-one is a single node installation, in which you don’t trouble yourself with non-functional requirements such as resilience or scalability. In an all-in-one deployment, Jaeger uses in-memory persistence by default. Alternatively, you can choose to use Badger, which provides an ephemeral storage based on a temporary filesystem. You can find more details on using Badger here.
Bear in mind that both in-memory and Badger are meant for all-in-one deployments only, and are not suitable for production deployments.
When deploying Jaeger in production, you need to address data persistence, high availability and scalability concerns. In order to address these concerns, you need to deploy additional services.
First of all, you should deploy and configure an external persistence storage for your span data. The recommended persistence storage for Jaeger in production is Elasticsearch.
Secondly, when dealing with a high load of span data, you should deploy Kafka in front of the storage to handle the ingestion and provide backpressure.
Running in production entails many other considerations not covered in this post, such as upgrades to Jaeger components as well as Elasticsearch, Kafka or any additional service in the deployment; monitoring the different services, and securing access to these services.
There’s another option: using Jaeger as a managed service so that you can leverage the best open source for distributed tracing without having to deal with its deployment and maintenance overhead. We at Logz.io did that with Log Analytics, taking the ELK Stack and offering it as a fully managed service, and then with open source grafana for infrastructure monitoring. Now we offer that with Jaeger, which comes with alerting and logs-traces correlation for full observability. Join our Beta program and try it out.
Previously published at https://logz.io/blog/jaeger-persistence/