If you are new to “Service Mesh” and “Envoy”, i have a post explaining both of them here.
This is the second post in the Observability with Envoy service mesh series, you can read the first post about Distributed Tracing here.
With microservices you cannot be in the dark when it comes to monitoring, you need to at least know that something is going wrong.
Let us look into how Envoy can help us get some info on what is going on with our services. With a service mesh, all the traffic goes through the mesh, meaning no service talks to the other service directly, the service make a call to Envoy and Envoy will route the call to the destination service, so Envoy will have context about the incoming and outgoing traffic. Envoy generally provides metrics about the incoming requests, outgoing requests and the state of the Envoy instance.
Here is an overview of what we are trying to build
overall setup
Envoy supports publishing metrics in 2 or 3 formats, but for this post we will use statsd format.
So with that said, the flow will be, Envoy pushes the metrics to statsd and from statsd we will pull the metrics using prometheus (a time series database) and then we will visualise the metrics using grafana.
In our setup diagram i have mentioned statsd exporter instead of statsd for a reason, we are not going to have statsd as such, we are going to have a converter(service) which will accept data in statsd format and expose it in prometheus format. Gets the job done for us.
Envoy’s metrics can be majorly classified into two
Let us look at an Envoy configuration with stats sink
lines 8–13 tells Envoy that we need metrics in statsd format, what is the prefix for our stats(usually your service name) and the location where our statsd sink lives
lines 55–63 configures the statsd sink in our environment
that is all the configuration that is needed to get stats out of Envoy. Now if you look at lines 2–7, there are two things happening
You need to add the same stats configuration to the other side car Envoy’s of the services in our system (yes, every service has its own Envoy side car).
The services themselves are written in go and they do not do much except for calling other services through Envoy. You can look at the service and Envoy configurations here.
So right now we only have statsd exporter in the picture, with this, if we run the docker containers(docker-compose build & docker-compose up) and send some traffic to Front Envoy(localhost:8080), Envoy would start sending metrics about the traffic to our statsd exporter, which will convert the metrics to prometheus format and expose it in port 9102.
This is how the stats look like in statsd exporter
metrics from statsd exporter in prometheus format
there would be hundreds of stats and in the above screenshot we are seeing the latency metrics for communication between Service A and Service B. The metrics in the above image are in prometheus format
metric_name ["{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"] value [ timestamp ]
You can read more about it here.
We are going to use Prometheus as our time series database to store our metrics. Prometheus is not just a time series database, it is a monitoring system in itself, but in our setup we will use it as a datastore for our metrics. An important thing to note is prometheus is a pull based system, which means you have to tell prometheus where to scrape the metrics from, in our case it will be our statsd exporter.
Adding prometheus to the equation is very straightforward, we just need to pass the scrape targets(statsd exporter) as a configuration file to prometheus. Here is what the configuration will look like
scrape_interval is the frequency in which prometheus will pull configuration from the target.
So now we should have Prometheus up, and some data in prometheus as well. Let’s fire up locahost:9090 and see what it has
prometheus query page
as we can see, our metrics is available. You can do a lot more than just selecting existing metrics, you can read about prometheus query language here. It can also plot graphs based on our queries. Has also an alerting system.
If we load up the targets page in prometheus we see all the scraping targets and health of those targets
prometheus targets
Grafana is an awesome Visualisation & Monitoring solution which supports a lot of backends like Prometheus, Graphite, InfluxDB, ElasticSearch, etc...
Grafana has two major components that we need to configure
2. Dashboard: This is where you visualise the metrics from your data source. Grafana supports a wide variety of visual elements like Graphs, Single Stats, Heatmaps, etc… and you can extend this and build your own using plugins.
The only problem i have with Grafana is that there is no standard way of developing these dashboards as code. There are some third party libraries which support this and we will use the one from weaveworks called grafanalib.
Here is the dashboard that we are trying to build expressed as python code
We are building graphs for 2xx, 5xx and latency. lines 5–22 is important, it is extracting the service names available in our setup as grafana variables, it makes our dashboard dynamic, meaning we will be able to select the source and destination service for which we want to view these statistics. More about variables here.
You have to use the grafanalib command to generate the dashboard from the above python file
generate-dashboard -o dashboard.json service-dashboard.py
beware the generated dashboard.json is not easy to read.
So we just need to pass the dashboard and the datasource while starting up Grafana. And when you visit http:localhost:3000, you will be greeted with:
grafana dashboard
there you go, you have your 2xx, 5xx and latency charts and you also see the dropdown where you can select the source and destination services. There is more to grafana than what we have discussed, there is a powerful query editor, an alert system. More importantly, everything is extensible using plugins and applications, checkout an example here. If you are visualising metrics of common services like redis, rabbitmq, etc.. Grafana has a repository of public dashboards from which you can just import them and use. One more good thing about Grafana is you can create and manage everything with configuration files and code without dabbling much with the UI.
I would urge you to play with prometheus and grafana to figure out more. Thanks for you time. Please leave your feedback as comments.
You can find all the code, configuration files here.
dnivra26/envoy_monitoring_Demo for monitoring micro services with envoy service mesh, prometheus & grafana - dnivra26/envoy_monitoring_github.com